ADA  0 38  69  9 


BEST 

AVAILABLE  COPY 


Acoustic  signal  processing  based  on  the  short  time  spectrum 


by 


/ 


/ 

I. 


Michael  Wayne  Callahan 


r 


/ 


' I 


Mar 


76 


I 


'rrrKViT.NT  A 

• rrleJB®; 

. ..uitad 


w 


UTEC-CSc - 7 6 - 2 09 


This  research  was  supported  by  the  Advanced  Research 
Projects  Agency  of  t^he  Department  of  Defense  under  Contract 
No.  DAHC15-73-C-03b3, L 

“ f fi  ! ■ 7 1 


j /mLI  //  /J 


TABl.E  OF  CONTENTS 


ABSTRACT 


Chapter  1 INTRODUCTION 1 

1.1  Short-Time  Frequency  Analysis  1 

1.2  Tine/Frequency  Analysis  in  the 

Hunan  Auditory  System  3 

1.3  Contribution  of  this  Research 5 

Chapter  2 THEORY 

2.1  The  Fourier  Transform  of  Discrete  Signals  ....  10 

2.2  The  Short-Tine  Fourier  Transform  11 

2.3  Filter  Bark  Analogy  to 

the  Short-Tir.ie  Fourier  Transform 14 

2.4  Modi f ication  of  the  Short-time  Fourier  Transform  15 

2.5  Selection  of  a Uinclou 20 

2.S  Information  in  the  Magnitude  and  Phase 

Of  the  Short-Tine  Fourier  Transform  23 

Chapter  3 REMOVAL  CF  EROAD  BAND  BACKGROUND  NOISE 29 

3.1  Local  Uierer  Filtering  29 

3.2  Thresholding 32 

Chapter  4 ISOLATION  OF  PERCEPTUALLY  IMPORTANT 

SPEECH  FEATURES  35 

Chapter  S TUO  DIMENSIONAL  COMPRESSION  AND  EXPANSION 43 

5.1  Revieu  of  Homomorphic 

Compression  and  Expansion  43 

5.2  Two  Dimensional  Compression  and  Expansion  ....  44 

5.3  Comparison  of  Tua  Dimensional 

and  Homcmorphic  Compression  45 

5.4  Experimental  Results  48 

Chapter  6 REMOVAL  OF  LOCALLY  PERIODIC  INTERFERING  SIGNALS  . . 59 

£.1  Removal  by  Spectrum  Estimation  59 

£.2  Removal  by  Tuo  Dimensional  Filtering  51 

£.3  Experimental  Results  52 

Chapter  7 CONCLUSIONS 66 


Appendix  A A LIMIT  ON  ThE  UNCERTAINTY  PRODUCT 

FOR  THE  SrC.RT-TirE  SPECTRUM E3 

Appendix  B EXPERIMENTAL  METHODS 73 

B.l  Co~.putat.cnal  Information 73 

B.2  Recording  and  PiaybacK  of  Signals 74 

B.3  Spectrogram  Displays  75 

REFERENCES 77 

ACKNOWLEDGEMENTS 8 0 

FORM  DO  1 473  81 


ABSTRACT 


The  frequency  domain  representation  of  a time  signal  afforded 
by  the  Fourier  transform  is  a powerful  tool  in  acoustic  signal 
processing.  The  usefulness  of  tms  representation  is  rooted  in  the 
mechanisms  of  sound  production  and  perception.  Many  sources  of 
sound  exhibit  normal  modes  or  natural  frequencies  of  vibration,  and 
can  be  described  concisely  in  the  frequency  domain.  The  human 
auditory  system  performs  frequency  analysis  early  in  the  hearing 
process,  so  perception  is  often  best  described  by  frequency  domain 
parameters. 

This  dissertation  investigates  a new  approach  to  acoustic 
signal  processing  based  on  the  short-time  Fourier  transform,  a two 
dimensional  representation  wnich  shows  the  time  and  frequency 
structure  of  sounds.  This  representation  is  appropriate  for  signals 
such  as  speech  and  music,  where  the  natural  frequencies  of  the 
source  change  and  timing  of  these  changes  is  important  to 
perception.  The  principal  advantage  of  this  approach  is  that  the 
signal  processing  domain  is  similar  to  the  perceptual  domain,  so 
that  signal  modifications  can  be  related  to  perceptual  criteria. 

The  mathematical  basis  for  this  type  of  processing  is 
developed,  and  four  examples  are  described:  removal  of  broad  oand 
background  noise.  isolation  of  perceptually  important  speech 
feature:,  dynamic  range  compression  and  expansion,  and  removal  of 
locally  periodic  interfering  signals. 


CHAPTER  1 


INTRODUCTION 


1.1  Short-Time  Frequency  Analysis 

The  representat ion  of  a signal  by  its  Fourier  transform  is 
central  to  acoustic  signal  processing.  The  usefulness  of  thi6 
transform  derives  from  the  mechanisms  of  sound  production  and 
perception.  Many  sources  of  Bound  exhibit  normal  modes  or  natural 
frequencies  of  vibration,  and  such  phenomena  can  be  described 
concisely  in  the  frequency  domain.  There  is  clear  evidence  that  the 
human  auditory  system  performs  frequency  analysis,  and  perception  is 
often  best  described  by  frequency  domain  parame'ers. 

In  signals  such  as  speech  and  music,  the  characteristic 
frequencies  of  the  souroe  change,  and  perception  depends  on  the 
timing  of  these  changes.  For  signals  such  as  these,  a modification 
of  the  Fourier  transform  is  desired  which  will  show  the  salient  time 
and  frequency  structure  of  the  signal.  This  result  can  be  obtained 
by  introducing  a window  cr  weighting  function  which  isolates  a short 
segment  of  the  signal.  The  window  moves  along  the  signal  as  time 
progresses,  and  Fourier  analysis  is  applied  to  the  portion  of  the 
signal  seen  through  t he:  window.  The  result  is  the  short-tim.e 
Fourier  transform  Cl, 21  - a two  dimensional,  time/frequency 
representation  of  the  signal: 


S(u,t)  - X*  m (t-x)  *s  (x)  *exp  (-iux)  dx 


(1.1) 


2 


where  sit)  i.e  tne  signal  a-d  *’(t)  is  tne  window.  The  shc^t-tire 
Fourier  trans^orr  is  invjrticle  [1!.  a-a  ..  f n(0)  * C the  inverse  is 
usual  ly  cnosen  as 

sit)  » C2nm  (0)  3 *'XI  5 tu.  t)  -exp  (iut)  au.  (1.2) 

A visible  represent® t ion  o*  tne  snort-tine  Fourier  transform  of 
speech  is  shown  in  figures  1 ana  2.  Tne  cremate  in  these  figures 
is  frequency  (C-5CC3  Hr),  tre  aocissa  is  tine  (1  sec),  and  tne 
intensity  of  the  reflected  lignt  is  proportional  to  the  magnitude  of 
S(u.t).  The  dif fe-ence  in  appearance  of  tre  two  figures  is  due  to 
the  window.  Figure  i was  made  witn  a relatively  6hont  window  (10 
msec)  so  that  tine  resolution  is  good  - individual  pitch  pulses 
caused  by  the  vocal  cords  opening  and  closing  can  be  seen.  Figure  2 
was  node  with  a longer  window  (20  nsec)  ana  frequency  resolution  is 
increased  - pitch  pulses  are  not  visible  but  the  resulting  harmonic 
structure  is.  The  effect  of  the  window  will  be  discussed  in  more 
detail  in  the  next  chapter. 

Time/frequency  displays  like  Figure  1 and  2.  which  are  commonly 
called  spectrograms,  are  widely  u6eo  by  speech  researchers  12,33. 
These  displays  are  useful  because  they  show  features  important  to 
production  and  perception  o*  speech.  Thi6  is  illustrated  by  a 
series  of  experiments  pt  •fnrmed  at  Haskins  Laboratories  [4,53.  In 
these  experiments,  a mechanical  system  was  used  to  play  back 
otylized,  hand-painted  spectrograms  of  closely  related  sounds,  such 
us  syllables  beginning  with  p,  t,  and  k.  It  was  found  that 
perceptual  distinctions  among  these  sounds  were  explained 
differences  in  their  time/ frequency  structure  such  as  the  peak 


frequency  c f :re  noise  c-rst  a-»  tre  direction  o'  subsequent 
formant*  transitions.  ’»  e$e  features  a«*f  oe  accurately  reproauce^. 
Dy  systems  for  syntnesis  or  transmission  o'  speech,  but  they  a-e  "ot 
readily  apparent  in  tre  Jpeecn  uave*c^a  cr  its  conventional  Fourier 
transfora. 

Because  it  "ccels  tpeecn  production  ano  perception.  the 
sr.crt-ti»e  nCuri.er  trans'erm  nas  tee'  used  in  an  efficient  method  o' 
speech  encoaing.  're  pr3Se  eococe'-  161  perforrs  analysis  cased  o'- 
i.i  (campled  in  f-equency).  ana  transmits  banal  ini  ted  signals 
corresponding  tc  tre  magnitude  arc!  prase  of  the  shert-tine 
transform.  Speecn  is  reconstructed  at  the  receiver  using  1.2.  .n  a 
more  general  sense,  nearly  all  Channel  vocoders  are  based  on 
shor t- 1 ime  frequercy  aral^sis  rJ.. 


1.2  Time/Frequency  ^->0- ^ si  s in  tre  Human  Auditory  System 

Helmholtz  [33  firs  p-epcseo  a theory  of  hearing  based  on 
frequency  analysis  of  acoustic  stimuli.  Tris  feature  of  hearing  has 
been  observed  in  nu"e*ous  psycncpr.ysiaal  experiments  [9.103.  and  was 
put  on  a firm  physiological  basis  by  von  BeKesy  [113  and  <iang  . 

A schematic  diagram  o'  the  periphe-al  audito-y  system  is  shown 
in  Figure  3.  Scurd  pressure  waves  cause  vibration  of  the  eardrum  or 
tyroannic  merbnane.  'hese  vibrations  are  transmitted  by  .he 
ossicular  bones  of  the  middle  ear.  whose  main  function  is  impedance 
transformation  between  the  air  medium  in  the  outer  ear  and  the  fluid 
medium  of  the  cochiea.  The  cochlea  is  a slender,  fluid  filled  tube. 


*Formop t s a-'e  the  prominsnt  spect-a.  pea*6  m speech  caused  oy 
resonances  in  the  vocal  tract. 


I * 


ciiviCifcQ  into  two  cna'  ctr  5 bj  tne  casilar  memorane.  It  is  actually 
coilea  like  a snail  shell  (nence  its  nar;.e)  . but  is  shown  unrolled 
for  clarity  in  Figure  2.  Tne  vibrations  transmitted  to  the  cochlear 
fluid  cause  motion  of  the  basilar  membrane,  and  this  motion  iS 
censed  by  the  auditory  nerve. 

Von  Bokocy  111]  showed  that  acoustic  stimuli  set  up  traveling 
waves  on  the  basilar  membrane.  The  mechanical  properties  of  the 
membrane  are  analogous  to  those  of  a non-uniform  transmission  line: 
waves  of  a given  frequency  travel  along  the  membrane  until  they 
reach  the  point  resonant  at  that  frequency,  and  then  are  rapidly 
attenuated.  Displacement  of  the  membrane  is  greatest  at  the  point 
of  resonance.  As  a result,  the  response  of  the  basilar  membrane  at 
a specific  point  along  its  length  is  much  like  that  of  a relatively 
broad  bandpass  filter.  The  bandwidth  of  successive  points  is 
roughly  constant  Q,  so  frequency  resolution  is  best  at  low 
frequencies  and  time  resolution  best  at  high  frequencies. 

The  auditory  nerve  terminates  in  hair  cells  along  the  length  of 
the  basilar  membrane.  The  exact  mechanism  of  the  mechanical  to 
neural  conversion  is  not  well  understood,  but  Kiang  [121  has  shown 
that  the  firing  rate  of  each  neuron  is  characterized  by  a response 
curve  similar  to  that  of  the  basilar  membrane.  The  frequency 
unalysis  of  the  basilar  membrane  is  preserved  in  the  auditory  nerve. 

The  pattern  of  nerve  firings  available  to  higher  centers  of 
hearing  is  thus  a two  dimensional,  time/frequency  representation  of 
the  acoustic  stimulus,  similar  to  the  short-time  spectrum.  The 
effective  bandwidth  of  the  auditory  system  based  on  psychophysical 
data  is  about  100  Hz  for  low  frequency  stimuli  (100-350  Hz),  and 


Bc»-  ♦*** 


5 


increaces  to  about  10C0  Hz  for  5C00  Hz  stimuli  [2,91. 

1.3  Contribution  of  This  Research 

This  paper  investigates  acoustic  signal  processing  based  on  the 
short-time  spectrum.  The  short-time  Fourier  transform  of  the  signal 

I 

is  calculated  as  in  1.1,  and  the  magnitude  of  the  transformed  signal 
is  modified  iri  an  appropriate  way.  A new  signal  is  then  obtained 
from  the  modified  transform  using  1.2. 

This  approach  is  suggested  by  time/frequency  analysis  in  the 
auditory  system,  and  is  appropriate  for  systems  when  subjective 
perception  of  the  output  signal  is  the  primary  measure  of 
performance.  The  short-rime  Fourier  transform  is  used  to  provide  a 
time/frequency  representation  because  it  is  computationally 
effecient.  It  does  not  model  auditory  system  processing  in  detail  - 
for  example,  frequency  analysis  in  the  auditory  system  is  roughly 
constant  Q,  while  the  short-time  Fourier  transform  is  constant 
bandwidth.  However,  the  simplifications  inherent  in  use  of  this 

transform  do  not  unduly  limit  its  effectiveness  in  predicting 
perception  for  the  experiments  described  here. 

This  approach  makes  use  of  spectrograms  as  an  aid  for  system 
design.  The  source  of  perceptual  problems  can  often  be  recognized 
by  looking  at  a spectrogram  of  the  signal.  For  example,  one  can 
make  a fairly  good  visual  separation  of  signal  and  background  noise 
in  the  spectrogram  of  a noisy  recording  shown  in  Chapter  3.  The 
regions  of  the  spectrogram  where  noise  is  visually  obvious  are 
generally  those  most  noticeable  in  listening.  An  algorithm  which 
selectively  attenuates  portions  of  the  spectrogram  dominated  by 
noise  will  reduce  the  noi6e  perceived  in  the  modified  signal. 


B 


T ime  'frequency  processing  results  in  nonlinear,  adaptive 
systems  whose  effect  is  determined  by  the  local  time  and  frequency 
structure  of  the  signal.  These  systems  art  inherently  more  flexible 
than  linear,  time-invariant  systems  whose  c esign  must  be  based  on 
average  or  worst  case  signal  characteristics.  The  disadvantage  of 
these  systems  is  their  mathematical  complexity.  Very  few  of  the 
systems  investigated  here  can  be  analyzed  in  a concise  and  complete 
way. 


FIGURE  1 


Magnitude  of  the  short-tine  Fourier  transform  using 
short  (10  msec)  window.  The  picture  represents  0.8  sec  of 
speech,  0 - 5000  Hz.  It  has  been  scaled  6 dB/oct  above 
400  Hz  to  improve  the  visibility  of  high  frequency 
components.  The  speech  is  "open  the  crate." 


FIGUHE  2 


i-lagnitude  of  the  short- time  Fourier  transform  using 
long  (30  msec)  window.  The  speech  and  other  parameters  are 
the  same  as  in  Figure  1. 


Auditory  Nerve 


FIGURE  3 


Schematic  orauing  of  the  peripheral  auditory  syst 


CHAPTER  2 


THEORY 

2.1  The  Fourier  Transform  of  Discrete  Signals 

The  Fourier  transform  for  discrete  (sampled)  signals  i6  similar 
to  the  Fourier  series  representation  of  periodic  signals,  with  the 
role  of  the  time  domain  and  the  frequency  domain  reversed.  The  time 
domain  6ignal  is  discrete,  60  the  frequency  domain  signal  is 
periodic.  The  period  of  the  Fourier  transform  is  inversely 
proportional  to  the  sampling  period  - this  causes  the  phenomenon  of 
frequency  domain  aliasing,  where  successive  periods  of  the  Fourier 
transform  overlap  if  the  time  signal  is  not  sampled  at  a 
sufficiently  high  rate  C133 . 

If  f (n)  i6  a time  sequence  satisfying 

2“~«  If  (n)  i < ..  , (2.1) 

the  Fourier  transform  of  f(n)  and  its  inverse  are 

F (0 ) ■ 2“-®  f (n)  «exp  (-iflkn) , (2.2) 

f (n)  - (2n)'lX:  F(n).e*p(iflr’  dO  . (2.3) 

0 and  n are  dimensionless.  They  are  related  to  the  radian  frequency 
and  time  by  the  sampling  period  T: 


11 


It  can  be  seen  from  2.2  that  F(0)  has  period  2n. 

Transforming  2.2  and  2.3  leads  to  the  following  two  equations, 
which  will  be  useful  later: 

S*  - (2it) exp  [iO  (n-l)  ] dO)  . (2.5) 

2n-S(0-0/)  - exp[i(fl-fl')n] . (2.0) 

£„i  is  the  Kroniker  delta  and  SfO-OO  is  the  (periodic)  Dirac  delta 
'•unction. 

This  brief  review  is  sufficient  for  the  development  of  the 
short-time  Fourier  transform  in  this  chapter.  The  properties  of  the 
Fourier  transform  of  discrete  signals  are  discussed  in  detail  in  a 
number  of  standard  texts,  such  as  References  13  and  14. 

2.2  The  Short-Time  Fourier  Transform 

The  discrete  short-time  Fourier  transform  of  a sequence  6(n)  is 
defined  by 

S(k,n)  - 2“-®  s (r) *m (n-r) *exp (-iOkr)  (2.7) 

0U  - 2rtk/N  k - 0,1,—  N 

where  m(n)  is  an  appropriately  chosen  window.  S(k,n)  is  a two 
dimensional,  complex  function  of  frequency  (k)  and  time  (n) . It  can 
be  interpreted  as  samples  of  the  Fourier  transform  of  the  portion  of 
the  sequence  s(n)  seen  through  the  sliding  window  m(n).  The  window 
is  of  finite  length,  and  is  normally  chosen  as  the  unit  sample 
response  of  a low  pass  filter.  The  discrete  short-time  spectrum  is 
the  squared  magnitude  of  S(k,n). 

Since  the  window  m(n)  is  of  finite  length,  2.7  can  be 


12 


implemented  efficiently  using  the  Fast  Fourier  Transform  algorithm 
C153  . 

The  inverse  to  2.7  can  be  chosen  a6  [13 

6(n)  - i/n2£.*o  3 (k, n) -exp (iOkn) . (2.8) 

This  inverse  imposes  a mild  restriction  on  the  window,  which  can  be 
seen  by  substituting  2.7  into  2.8. 

s'(n)  - i/n2»o  exp  (iOkn)  • s (r)  *m  (n-r)  *exp  (iOkr)  (2.9) 

Since  both  sums  have  a finite  number  of  bounded  terms,  the  order  of 
summation  can  be  reversed. 

s'(n)  - 2r..a,  s(r).m(n-r).i/N2l,:i  exp  tiOk  (n-r)  3 (2.10) 

l/N^r:*  exp  (i0k  l ) - 1 U0,±N,±2N,—  (2.11) 

- 0 otherwise 

Now  if  the  inverse  transform  is  to  result  in  the  original  signal, 
the  window  mu6t  satisfy 

m (n)  ■ 1 n»0  (2.12) 

■ 0 n»±N,±2N,*** 

but  is  otherwise  arbitrary. 

Notice  that  with  the  inverse  defined  by  2.8,  only  one  point  of 
the  time  sequence  i6  obtained  from  each  time  6lice  of  S(k,n).  Thi6 
is  a desireable  characteristic  for  the  experiments  discussed  later, 
in  which  S(k,n)  is  modified  and  2.8  is  used  to  obtain  a new  time 
signal.  Changes  to  S(k,r)  at  a given  time  will  be  reflected  only  in 
the  corresponding  samples  of  the  time  signal. 

The  short-time  transform  i6  normally  sampled  less  densely  in 


13 


time  than  the  original  signal,  at  a rate  determined  by  the  frequency 
band  occupied  oy  the  Fourier  transform  of  the  wirrow.  One  might 
therefore  expect  that  intermediate  samples  of  S(k,n)  would  have  to 
be  obtained  by  interpolation  prior  to  reconstruction  of  e(n).  For 
an  N point  signal  and  a K point  transform  desampled  in  time  by  a 
factor  L,  this  would  require  (N-N/L)*K  interpolations*  and  N 
calculations  using  2.8.  Assuming  that  about  9 multiply  and  add 
operations  are  required  ,:or  an  interpolation,  the  time  required  for 
reconstruct  ion  is 

t,  - [ (N-N/L)  *9K  + MOt  (2- 13) 

where  r is  the  multiply  and  add  time.  The  most  distressing  feature 
of  2.13  is  the  fact  that  reconstruction  time  is  proportional  to  the 
frequency  resolution  of  the  short-time  transform. 

Portnoff  [15]  has  developed  a more  efficient  algorithm  to 
accomplish  the  interpolation  and  inverse  transform  which  makes  use 
of  the  Fast  Fourier  Trars'o-m  algorithm.  Portnoff’ s method,  which 
is  explained  in  detail  i ^efe^ince  15,  reverses  the  order  in  which 
the  interpolation  sum  anu  tne  sum  required  by  2.8  are  performed. 
This  results  in  a sum  which  is  similar  to  2.8,  but  ha6  the  form  of  a 
discrete  Fourier  transform  of  S(K,n)  in  the  k direction.  A new 
array  can  therefore  be  obtained  from  S(k,n)  using  the  Fast  Fourier 
Transform,  and  points  in  the  time  sequence  are  obtained  by 
interpolating  selected  data  in  the  transformed  array.  Only  one 


*S(k,n)  is  complex,  but  due  to  real-even/imaginary-odd  symmetry, 
only  K interpolations  are  needed  for  each  time  slice. 


* 

£ 


14 


interpolation  is  required  for  each  intermediate  point  in  the  time 
sequence.  Tho  Fast  Fourier  Transform  requires  a time  proportional 
to  K*log,(K)  * to  transform  a K point  sequence  [13].  The  constant  of 
proportionality  is  typically  about  three  times  the  multiply  and  add 

time,  so  the  time  required  for  reconstruction  using  the  Portnoff 
method  is 

t*  ■ [ (N/L) *3K*log2(K)  + (N-N/L)*9]r.  (2.14) 

For  a given  window  shape,  K/L  is  a constant  - scaling  the  window 
length  changes  the  desampling  ratio  by  the  eame  factor. 
Reconstruction  time  therefore  grows  only  as  the  logarithm  of  the 
frequency  resolution  of  S(k,n). 

To  appreciate  the  time  savings  afforded  by  Portnoff’ s method, 
consider  a case  where  K ■ B12  and  L - 64  (typical  for  some  of  the 
experiments  described  later).  Neglecting  N/L  with  respect  to  N 

t,  * lOK-Nr  - E;120-Nt  , 

t,  * (24*logz(K)  + 9) Nr  - 225-Nr  , 

t,/t,  ^ 23  . 

2.3  Filter  Bank  Analogy  to  the  Short-Time  Fourier  Transform 

Equations  2.7  and  2.8  can  be  interpreted  in  another  way  which 
may  provide  additional  insight  concerning  the  short-time  Fourier 
transform.  Consider  the  system  shown  in  Figure  4.  The  N bandpass 


This  assumes  that  < is  a power  of  2,  which  was  always  the  case  in 
the  work  described  here.  The  Fast  Fourier  Transform  will  provide 
significant  time  savings  whenever  < is  highly  composite  [13]  . 


1 


15 


filters  BP0,  BP,,***  BP„.,,  have  unit  sample  responses 

bk(n)  ■ m (n) *exp (iftkn)  k-0,1,***  N-l  (2.15) 

where  m (n ) is  the  unit  sample  response  of  a low  pass  filter.  The 

filter  outputs  are  therefore 

yk(n)  - fck  (n-r)  *s  (r)  (2. 15) 

■ Sr.-*  m (n-r)  *s  (r)  *exp  CiOk  (n-r) ) . (2.17) 

The  outputs  are  then  multiplied  by  exp(-iflkn),  eo  that  all  eignal6 
will  be  low  pass.  This  process  is  often  referred  to  as  complex 
demodulation. 

zk(n)  - yk(n)  *exp(-iOkn)  (2.18) 

■ £"-•  m(n-r)*s(r)*exp(-iOkr)  (2.19) 

Comparison  of  2.19  and  2.7  shows  that  S(k,n)  ie  equivalent  to  the 
demodulated  output  of  a set  of  bandpass  filters.  Multiplication  by 
exp(-if)kn)  does  not  affect  the  magnitude,  eo  if  we  are  only 
interested  in  |S(k,n)l  demodulation  is  not  required. 

IS(k,n)l  - lyk(n)|  - lzk(n)l  (2.20) 

Reconstruction  of  a time  signal  is  accomplished  by  reintroducing  the 
center  frequency  of  eacn  filter  and  summing.  If  the  output  is 
scaled  by  1/N 

y(n)  - i/nSJ.'J  zk(n)*exp(iOkn)  (2.21) 

- i/n2?:J  yk(n).  (2.22) 


The  eystem  is  the  identity  system  and  reproducee  s(n)  if  m(n) 


IB 

satisfies  2.12,  as  before. 

The  equivalence  of  analysis  with  2.7  and  synthesis  with  2.8  to 
bandpass  filtering  and  summing  will  be  helpful  in  understanding  the 
experiments  discussed  later,  where  the  magnitude  of  S(k,n)  is 
modified  p-ior  to  recons truction  of  a time  signal. 

2.4  Modification  of  the  Short-Time  Fourier  Transform 

Chapters  3 through  G discuss  experiments  which  are  based  on 
modifying  the  short-time  Fourier  transform.  Three  operations  are 
involved:  multiplication  of  S(k,n)  by  a function  G (K, n)  (the  local 
Uliener  filter  or  a threshold  function),  convolution  of  |S(k,n)|  with 
a specified  function,  and  convolution  of  loglS(k,n)|  with  a 
specified  function.  In  each  case  the  resulting  modified  transform 
is  converted  to  a time  signal  using  2.8,  and  this  new  time  signal  is 
reanalyzed  - at  least  by  the  auditory  system.  It  would  be  nice  to 
expre66  the  effect  of  each  of  these  operations  in  a mathematically 
concise  way,  and  show  how  the  second  short-time  transform  is  related 
to  the  original.  However,  only  multiplication  is  mathematically 

tractible.  Fortunately,  the  results  for  multiplication  provide  some 
insight  into  the  effect  of  the  other  operations.  In  particular, 

*h&se  results  show  some  limitations  on  the  modifications  which  can 
be  accomplished. 

Consider  first  multiplication  of  the  the  short-time  transform 
by  a function  of  frequency  only.  This  represents  the  limiting  case 
where  the  spectrum  of  the  input  signal  changes  very  slowly.  A new 
time  signal  is  reconstructed  with  2.8. 


6'(n)  « i/n2*.'o  GU)'S(k,n)*exp(ifl,,m 


(2.23) 


If* 


17 


-l/Nir.-i Z“.„  G(k)*m(n-r)*6(r)-expti0. (n-r)]  (2.24) 

The  order  of  the  sums  can  be  reversed,  since  both  are  finite  and 
bounded.  Ue  define  a new  function 

fl(l)  - G(k)-exp(iO*l)  (2.25) 

and  in  terms  of  g(l),  2.24  becomes 

s'(n)  • 2r..»  g(n-r)«a<n-r)».i(r) . (2.2G) 

Equation  2.26  is  a convolution  sum,  so  the  original  signal  has  in 
effect  be**',  passed  through  a filter  with  unit  sample  response 

h(n)  - m(n)*g(*i) . (2.27) 

The  frequency  response  cf  this  filter  is  the  Fourier  transform  of 
h (n) . 


H (0)  - m(n).g(n).exp(-iOn)  (2.28) 

- m(n).G(K).exp{-i(0-n«)n)  (2.29) 

- i/NZr.'i  GU)-f1<0-0,)  (2.30) 

Equation  2.30  shows  that  the  intended  modification  is  smoothed  by 
the  Fourier  transform  of  the  window.  The  equivalent  filter  cannot 
have  detail  finer  than  tne  bandwidth  of  11(0).  This  is  a 
manifestation  of  the  time-frequency  uncertainty  principle  discussed 
in  the  next  section. 

Equation  2.30  can  be  interpreted  in  terms  of  the  filter  bank 
analogy  of  Figure  4.  G(k)  acts  like  a gain  setting  for  each  of  the 
N filters  - the  frequency  response  of  the  ktH  filter,  11(0-0,),  is 


IS 


multiplied  by  G(k).  The  composite  frequency  response  of  the  filter 
bank  is  the  sum  of  the  N individual  frequency  responses. 

Equation  2.25  shows  that  g(n)  is  periodic,  with  period  N.  The 
unit  sample  response  desired  is  the  principal  period  of  g(n), 
OSniN-1.  Uhen  the  window  is  longer  than  N,  2.27  shows  that 
points  from  the  second  period  of  g(n)  are  included  in  h(n).  This 
will  produce  audible  ef‘:ect6  in  the  output  signal  which  are  most 
pronounced  when  G(k)  is  not  smooth.  For  this  reason,  the  window 
length  should  not  exceed  the  number  of  frequency  samples  in  S(k,n). 
This  restriction  is  not  limited  to  multiplication,  but  applies 
whenever  S(k,n)  is  modified. 

Next  we  construct  the  short-time  transform  of  the  modified 
signal.  It  is  convenient  to  use  the  filter  bank  analogy  of  Figure 
4. 


yj[(n)  - buInlSJs'In)  (2.31) 

■ b,(n)§h(n)®s(n)  (2.32) 

yk(n)©h(n)  (2.33) 

where  the  symbol  ® represents  convolution.  Equation  2.33  can  be 

written  in  terms  of  the  short-time  transform  using  2.18. 

S'(k,n)  - 2".®  S(k,n-r)'h(r)>exp(-iflkr)  (2.34) 

The  short-time  spectrum  has  been  convolved  in  the  time  direction 
with  h(n);  the  term  exp(-iO,r)  compensates  for  the  demodulation  of 
S(k,n).  If  G(k)  and  thus  H(Q)  are  approximately  constant  over 
frequency  ranges  comparable  to  the  bandwidth  of  fl(fl),  synthesis  and 
reanalysis  will  be  approximately  an  identity  operation. 


19 


S'(k,n)  a G(k)*S(k,n)  (2.35) 

Next,  consider  multiplication  of  S(k,n)  by  a function  of  time 
and  frequency.  In  this  case,  2.31  becomes  a superposition  sum 

s'(n)  - g(n-r,n)*m(n-r)-s(r)  (2.3S) 

where 

g ( l , n)  - 1/N2f;i  G (k , n)  *exp  (iOk  l ) . (2.37) 

Each  sample  of  s'(n)  is  now  computed  from  a new  unit  sample 
response.  The  intended  modification  G(k,n)  is  smoothed  in  frequency 
by  H (0) , as  before. 

Defining  h(l,n)  - g ( l , n) *m (n) , the  short-time  transform  of  the 
modified  signal  is 

S'(k,n)  - m (n-r)  *h  (r-x,  r) -s  (x) -exp  (-ifihr)  . (2.38) 

We  cannot  proceed  as  in  2.31  - 2.34  because  of  the  time  dependence 
of  h(l,n).  The  physical  reason  can  be  sessn  by  considering  the 
filter  bank  analogy.  Multiplication  by  G(k,n)  is  equivalent  to  a 
time-varying  gain  for  each  filter.  The  output  of  each  filter  is 
amplitude  modulated,  which  broadens  the  signal  spectrum  eo  that  it 
will  not  "fit"  in  the  corresponding  filter  when  analyzed  the  second 
time.  However  if  h(l,n),  viewed  .as  a function  of  n,  is 
approximately  constant  over  the  length  of  the  window  (i.e.,  the  unit 
sample  response  of  the  equivalent  filter  changes  slowly) 

m (n-r) «h (r-x, r)  * m (n-r) *h (n-r, n) . 


Equation  2.38  then  becomes 


20 


S'(k,n)  * 2r.»Z"-<»  m (n-r) -h  (r-x,  n)  *s  (x)  *exp  { — i 0h  r ) . (2.39) 

Substituting  l - r-x  and  using  2.7 

S'(k,n)  a 2"  «>  S(k,n-l)*h(l.n)*8xp(-i0i,  1) . (2.40) 

Equation  2.40  i6  tbs  time  dependent  result  analagous  to  2.34. 

In  order  for  analysis  and  synthesis  to  be  an  approximate 
identity,  modifications  to  the  short-time  transform  must  change 
slouly  in  time,  compared  tc  the  windou,  and  slowly  in  frequency, 
compared  to  the  Fourier  transform  of  the  window.  This  restriction 
applies  whether  the  modi -ications  are  produced  by  multiplication  or 
some  other  operation  on  the  short-time  transform. 

In  the  case  where  the  modified  signal  is  reanalyzed  the 

auditory  system,  the  interpretation  of  2.40  is  slightly  different. 
S(k,n)  and  S'(k,n)  represent  signals  in  the  auditory  system,  and  not 
the  short-time  transform  used  to  produce  h(l,n).  Assuming  that 
h(l,n)  represents  the  desired  modification  with  sufficient  accuracy, 
it  i6  only  necessary  that  h(l,n)  change  slowly  compared  to  the 
effective  auditory  system  window  for  the  modification  vo  ba 

perceived  correctly.  The  auditory  system  is  relatively  broad  band 
compared  to  the  short-time  transforms  used  for  the  experiments 

described  later,  so  it6  effective  window  is  shorter  C2) . 

2.5  Selection  of  a Ulindcw 

The  short-time  Fourier  transform  is  really  a family  of 

transforms,  corresponding  to  all  the  possible  window  functions  which 
can  be  used  in  2.7.  Members  of  this  family  can  be  quite  different, 
as  is  shown  by  Figure  5.  The  same  input  speech  was  used  for  all 


21 


three  spectrograms.  Figure  5(a)  was  m3de  with  a Fourier  window  of 
length  (2ND  25.6  msec. 

m(n)  ■ 1 -N  < n < N (2.41) 

- 0 otherwise 

Figure  5(b)  was  made  with  a raised  cosine  or  hanning  window  of  the 
same  length. 

m(n)  ■ .5  + . 5*cos (nn/N)  -N  < n < N (2.42) 

- 0 otherwise 

Figure  5(c)  was  made  with  a hanning  window  12.8  msec  long.  The 
differences  in  the  appearance  of  the  three  spectrograms  are  due  to 
the  shape  and  length  of  the  window  functions  and  their  Fourier 
transforms.  It  is  obvious  from  Figure  5 thp'  that  the  effect  of  a 
process  based  on  the  short-time  spectrum  will  depend  on  the  window. 

The  short-time  Fourier  transform  cannot  have  arbitrarily  good 
time  and  frequency  resolution.  The  time  and  frequency  resolution 
are  determined  by  the  wirdou  (see  Appendix  A)  and  are  limited  by  the 
well  known  Heisenberg  uncertainty  principle  [163,  which  relates  the 
temporal  extent  of  a function  f(t)  and  the  frequency  spread  of  its 
Fourier  transform  F(u). 


A‘t  - j;  (t-ie)j.if (t)ij 

dt 

(2.43) 

's.z  if  (tii*  <st 

- S.Z  (u-u0)MF  (o)  1* 

du 

(2.44) 

riFfu)!’  du 

AtAu  * .5 

(2.45) 

Ulhen  the  desired  process  must  change  rapidly  in  both  frequency 
and  time,  probably  the  best  solution  is  to  match  the  frequency  and 


22 


time  resolution  of  the  window  to  the  average  resolution  of  the 
auditory  system.  In  the  experiments  described  here,  the  process 
changes  slowly  compared  to  the  temporal  resolution  of  the  ear.  In 
this  case,  the  length  of  the  window  is  based  on  the  time  that  the 
signal  is  approximately  stationary,  in  order  to  best  use  the 
time/ frequency  resolutior  available. 

If  the  window  is  much  longer  than  the  stationary  time  of  the 
signal,  the  temporal  extant  of  events  is  not  well  resolved  and  the 
short-time  spectrum  begins  to  represent  an  average  of  sequential 
events.  If  the  window  is  much  shorter,  the  frequency  resolution  of 
the  short-time  spectrum  _s  unnecessarily  reduced.  For  signals  such 
as  speech  and  music,  the  stationary  time  varies  and  the  shape  of  the 
window  must  be  chosen  to  represent  spectral  features  of  interest 
well  over  a range  of  stationary  times.  The  usual  assumption  for 
speech  is  that  the  signal  is  stationary  for  about  2D  msec.  This 
corresponds  approximately  to  the  length  of  the  window  used  for 
Figures  5(a)  and  5(b). 

The  minimum  uncertainty  function  which  satisfies  the  equality 
in  2.45  ie  the  gaussian.  Landau  and  Pollock  [17]  have  solved  the 
uncertainty  problem  under  conditions  more  appropriate  to  the 
discrete  short-time  Fourier  transform.  They  show  that  the  function 
which  vanishes  for  Itl  > T/2  and  has  as  much  of  its  energy  as 
possible  in  the  interval  |u|  <0  is  the  zeroth  prolate  spheroidal 
wave  function,  ^0( t,  flT/2) . Many  other  finite  length  window 
functions  for  spectral  analysis  have  been  investigated  [18,19]. 
Each  solves  the  problems  of  spectral  resolution  and  computational 
efficiency  in  a slightly  different  way.  Host  of  these  functions  are 

> 


23 


emoo  th. 

A hanning  window  wats  used  for  all  the  experiments  described  in 
this  paper.  This  window  was  easy  to  implement  and  provided  adequate 
time  and  frequency  resolution.  It  is  likely  that  some  improvement 
could  be  made  in  each  experiment  by  optimizing  the  window  shape  for 
the  needs  of  the  particular  process.  However,  informal  tests  using 
other  window  f apes  indicate  tnat,  so  long  as  the  window  is  smooth, 
the  experimental  results  will  not  change  significantly. 

Equation  2.12  places  no  restrictions  on  window  length,  oo  long 
as  the  window  has  appropriate  zeros.  For  processes  which  modify  the 
short-time  spectrum,  the  window  length  cannot  be  greater  than  the 
number  of  frequency  samples  in  S(k,n),  as  was  shown  in  the  last 

section. 

2.S  Information  in  the  Magnitude  and  Phase 

of  the  Short-time  Fcurier  Transform 

The  remaining  chapters  describe  processes  which  alter  the 
magnitude  of  S(k.n)  but  leave  the  phase  unchanged.  An  experiment 
was  performed  to  determine  how  information  in  speech  is  divided 
between  the  magnitude  and  phase,  and  indicate  limitations  inherent 
in  processing  the  magnitude  only. 

The  short-time  transrorm  of  a typical  sample  of  speech  was 
calculated  using  a 25. G msec  banning  window.  Two  new  signals  were 
thon  synthesized  from  this  data  using  2.8.  The  first  used  the 
original  phase  and  flat  magnitude.  The  second  used  the  original 
magnitude  and  a random  phase  function.  The  resulting  eignals  are 
shown  in  Figure  G.  The  magnitude  of  their  short-time  transforms  tare 


24 


shown  in  Figure  7. 

The  flat  magnitude,  original  phase  signal  contains  mostly 
excitation  information.  One  can  tell  whether  the  speech  is  voiced 
or  unvoiced  and  determine  the  pitch,  but  the  speech  is  otherwise 
unintelligible.  Pitch  pulses  can  be  seen  in  the  waveform  of  Figure 
S(b),  and  harmonic  structure  is  evident  in  the  spectrogram  of  Figure 
7(b).  The  inability  of  magnitude  flattening  to  eliminate  pitch 
structure  is  to  be  expected  because  of  the  frequency  smoothing 
discussed  in  the  last  section.  The  intended  modification, 
(IS  (k,  n)  |]  **,  is  smoothed  by  the  Fourier  transform  of  the  window  as 
in  2.30,  and  fine  detail  - primarily  ha-mcnic  structure  due  to  pitch 
- is  averaged  out.  It  is  apparent  from  Figure  7(b)  that  smoothing 
has  also  occurred  in  the  time  direction.  This  is  due  to  the  finite 
time  resolution  of  the  window  used  to  calculate  the  original 
short-time  transform. 

The  origj  a',  magnitude,  ranJ(.  nhase  signal  contains  mostly 
vocal  tract  information.  The  speech  is  quite  intelligible,  but 
sounds  whispered.  Thi6  is  evident  from  the  noise-like  structure  of 
the  waveform  in  Figure  6(c),  and  the  lack  of  harmonic  structure  in 
Figure  7(c). 

These  results  are  consistent  with  comments  by  Flanagan  (6)  , who 
notes  that  in  a narrow  band  implementation  of  the  Phase  Vocoder  the 
phase  signal  is  priniari.y  excitation  and  the  magnitude  signal  is 
primarily  vocal  tract  information. 


(b) 


(c) 


FIGURE  5 

Effect  of  the  window  on  the  short -time 
transform:  (a)  2 5. 6 msec  Fourier  window.  (b)  25 

hanning  window,  (c)  12.6  msec  h a n n i n g window, 

spectrogram  represents  one  second  of  speech,  0-5, 
All  three  have  been  scaled  by  6 db/oct  above  40C  Hz 
speech  is  "man's  primary." 


Founer 
.6  nsec 
Each 
000  Hz. 
. The 


FIGURE  6 


Signals  reconstructed  from  the  magnitude  and  phase  of 
the  short-tine  Fourier  transform:  (a)  original  speech,  (b) 
flat  magnitude,  original  phase  reconstruction,  (c)  original 
magnitude,  random  phase  reconstruction.  The  waveforms  are 
one  second  long. 


U'-j.  - V“^  / *,  >*:•-  * •<  * . 

I f .v'.-jl  ■ 

— &F3I? 

ta«:.aa 


> \ *'  S ' ** 

• • m ^ 

> * 

it  Jk 

Nrt 

•ft  * X 

r IGUHE  7 

Spectrograms  of  signals  reconstructed  from  the 
magnitude  and  phase  of  the  short -t  irre  Fourier  transform:  (a) 
original  speech,  (b)  flat  magnitude,  original  phase 
reconstruction,  (c)  original  magnitude  , random  phase 
reconstruction.  The  spectrograms  1 n (a)  and  (c)  h a v « been 
scaled  6 db/oct  above  400  Hz. 


CHAPTER  3 


REMOVAL  OI;  BROAD  BANO  BACKGROUND  NOISE 


3.1  Local  Wiener  Filtering 

The  separation  of  a signal  from  additive  random  noise  is  a 
common  problem  in  acoustic  signal  processing.  The  optimum  linear 
system  for  performing  such  a separation,  based  on  a mean  square 
error  criterion,  was  obtained  by  Wiener  [201.  Given  the  signal  plus 
noi6e,  the  signal  is  estimated  with  a linear  combination  of  the 

data*. 


x ( t ) - s(t)  + n(t) 

est  is  ( t) ) - h(t,r)-x(r) 


(3.1) 

(3.2) 


If  the  signal  and  the  noise  are  stationary,  h(t,r)  depends  only  on 
the  time  difference  T-t-r. 


est  ( s ( t ) 1 - 2r—  h(t-r)-x(r) 


(3.3) 


The  function  h(r)  is  chosen  to  minimize  the  mean  square  error 


E (Is ( t ) - I"..  x(t-r)*h(r)  I*}  - minimum 


(3.4) 


where  E is  the  expected  value  operator.  Setting  the  variation  of 
3.4  to  zero 


*The  discrete  time  index  t is  used  in  lieu  of  n in  this  chapter  tc 
avoid  confusion  with  the  noise  signal. 


30 


i, 

f 

| 


ft 

*» 

£ 

1 

w 

v 

\ * 


r"*-; 

* 


SCEIIsvt)  - 2?'—  x(t-r)  *h  (r)  1*1  ] - 0 
E { ts  ( t ) - 2"-o  x(t-r)*h(r)]*x(t-l)}*Sh(l)  -0 


we  get,  since  the  variation  is  arbitrary 

Rsx(l)  - =»«d-p)-h(r)  -0.  (3‘7) 

rsk  ( \ ) i6  the  cross  correlation  E (s  ( t)  *x  ( t-l ) 1 , and  Rxxd)  i6  the 
autocorrelation  E lx ( t ) «x ( t-l ) i . Taking  the  Fourier  transform  of  3.7 


H (u) 

• <t>Sx 

(u)  /‘t>Xx  (u) . 

(3 

.8) 

Osx(u) 

is  the 

cross 

sptictrum 

of  s 

(t) 

and 

x ( t ) , defined  as 

the 

Fourier  transform  of 

Rsx,  and 

0*x(u) 

is 

the 

spectrum  of  x ( t ) . 

If 

the 

signal 

and 

noise 

are 

uncorrelated,  Rsx * Rss 

and 

R*x-Rss  + rnn.  60  3- 8 becomes 

H(u)  ■ Oss  ( u) 

<t>ss  <«)  + 0NH(o) 


Equation  2.9  assumes  that  the  signal  is  stationary,  and  does  not 
apply  to  signals  such  as  speech  and  music.  However,  we  might 
envision  a running  determination  of  2.9,  based  on  a local  section  of 
the  signal  which  is  approximately  stationary. 

H(u.t)  - $«(u.t)  (3.10) 
Oss(a,t)  + 0NH(u) 

A filtering  process  based  on  3.10  can  be  implemented  conveniently 
using  the  short-time  Fou-ier  transform.  The  local  spectrum  of  the 
signal  plus  noise  can  be  estimated  by  time  averaging  the  squared 
magnitude  of  the  short-time  transform 


at  IX  Ck.  tt-l  > |* 


31 

(3.11) 


where  L is  determined  ty  the  approximate  stationary  time  of  the 
signal . The  noise  spectrum  car,  be  estimated  in  the  same  manner 
durino  a period  when  there  is  no  signal.  The  noise  spectrum 
estimate  will  normally  have  smaller  variance  than  the  local  spectrum 
estimate  because  the  noise  is  stationary  a <d  the  estimating  time  can 
be  quite  large.  Using  Mese  estimates  and  assuming  that  the  signal 
and  the  noise  are  uncorrelated,  a local  Wiener  filter  is  obtained. 


H (K, t)  ■ 


Q«,U.  t)  - Q.„(x) 

t) 


H (k,  t)  - 0 


o„  > ©h 


0„  s 0M 


(3.12a) 


(3.12b) 


The  restriction  to  positive  values  is  required  because  and 
are  estimated  spectra  with  finite  variance. 

This  process  was  used  to  reduce  the  surface  noise  on  a 1S07 
recording  by  Enrico  Caruso.  The  spectrum  of  the  surface  noise  was 
estimated  from  silent  grooves  at  the  beginning  and  end  of  the 

record.  Observation  of  portions  of  the  short-time  spectrum  which 
are  dominated  by  noise  showed  that  the  roise  is  stationary 
throughout  the  recording.  Figure  7 shows  the  estimated  noise 

spectrum.  The  average  spectrum  of  the  entire  recording  i.  also 

shown  for  comparison. 

The  local  spectrum  of  the  noisy  signal  was  estimated  as  in 
3.11.  A 25.6  msec  harming  window  was  used  for  the  short-time 

Fourier  transform.  The  averaging  time  (2L+i)  was  100  msec.  This 
time  is  somewhat  longer  than  the  stationary  time  sugge.ted  by  the 
singing,  but  was  the  minimum  necessary  to  obtain  a local  spectrum 


32 


estimate  with  acceptable  variance.  The  short-time  transform  uas 
multiplied  by  H(k,t)  from  3.12,  and  a new  signal  constructed  from 
2.8. 

X'(k.t)  - H(k.  t)*X(k.  t)  (3.13) 
*'(t)  - X'(k.  tJ-oxpdO.t)  (3.14) 

Spectrograms  of  a oortion  of  the  original  and  processed  signal 
are  shown  in  Figure  8.  The  rapidly  varying  component  of  the 
spectrograms  i6  the  singing;  the  relatively  constant  harmonic 
structure  below  <*1500  He  is  the  orchestra. 

As  is  apparent  in  Figure  8(b).  the  process  greatly  reduces  the 
surface  noise.  Careful  listening  show6  no  change  in  the  singing 
voice.  The  variance  of  the  local  spectrum  estimate,  however,  allows 
narrow  bands  of  noise  to  be  audible,  even  though  they  have  been 
greatly  attenuated.  Tn„s  effect  is  quite  noticeable  in  passages 
where  the  signal  is  quiet  or  relatively  narrow  band,  so  that  the 
remaining  background  noise  is  not  masked.  It  is  more  noticeable 
than  white  noise  at  the  same  level  because  the  noise  spectrum  is 
narrow  and  changes  quite  rapidly. 

3.2  Thresholding 

Figure  7 shows  that  the  overall  signal  to  noise  ratio  for  the 
original  recording  is  about  24  dB.  The  local  signal  to  noise  ratio 
in  IX ( k , t ) I*  is  often  better,  because  the  signal  is  concentrated  in 
frequency  and  time,  while  the  noise  is  relatively  smooth.  The 
regions  of  |X(k,t)|*  can  -Therefore  be  separated  from  those  dominated 
by  noise  by  a thresholding  operation. 


The  thresholding  algorithm  ua6  designed  to  ensure  that 
background  noise  was  attenuated  uniformly  and  to  avoid  artifacts  in 
the  singing  such  as  suaden  3tarts  and  stops.  Two  reference  levels 
were  used  - a lower  level  3 dB  above  the  noise  spectrum  and  an  upper 
level  12  dB  above  noise.  Uhan  |X(k,t)l*  exceeded  the  upper 
threshold,  it  was  left  unattenuated  (both  in  the  time  direction  and 
the  frequency  direction)  until  it  passed  below  the  lower  threshold; 
otherwise,  I X ( K , t ) I * was  attenuated  by  24  dB.  The  effect  can  be 
visualized  in  terms  of  a contour  map  of  |X(k,t)|*.  If  a peak  in 
IX (k , t ) I*  i6  higher  than  the  contour  corresponding  to  the  upper 
threshold,  all  portions  of  the  peak  down  to  the  contour 
corresponding  to  the  lower  threshold  are  left  unchanged.  Other 
regions  of  the  contour  map  are  attenuated.  This  description  must  be 
modified  in  one  respect:  to  keep  computer  memory  requirements  small, 
the  algorithm  could  look  backward  in  time  only  50  msec. 

A spectrogram  of  the  resulting  signal  is  shown:  in  Figure  8(c). 
The  thresholding  algorithm  performed  better  than  the  local  Wiener 
filter  in  that  background  noise  was  completely  eliminated.  Again 
there  was  no  noticeable  change  in  the  singing  voice.  The  only 
apparent  degradation  was  in  the  sound  of  the  accompanying  orchestra, 
which  was  somewhat  thin  in  places  because  orchestral  harmonics 
remain  too  near  the  level  of  the  noise,  and  are  attenuated. 


1 • 1 14  « 

5 INCO 


Caruso  recording:  (a)  102H  point 

(b)  128  point  estimate  actually 

024  point  estimate  of  average 
g,  (d)  126  point  average  spectrum 


FIGURE  9 

background  noise  removal:  (a)  original  recording,  (b) 
processed  by  local  Wiener  filtering,  (c)  processed  by 
thresholding.  All  three  spectrograms  have  been  scaled 
6 dB/oct  above  400  Hz. 


CHAPTER  A 


ISOLATION  OF  PERCEPTUALLY  IMPORTANT  SPEECH  FEATURES 

There  is  considerable  evidence  that  certain  features  of  the 
short-time  spectrum  are  important  to  perception.  Flanagan  [23 
describes  examples  of  the  importance  of  pitch  changes,  formant 
f equencies  and  bandwidths,  and  voiced-unvoiced  changes.  All  of 
these  are  evident  in  the  short-time  spectrum.  Liberman’s  work  [5] 
in  synthesizing  speech  from  spectrograms  is  also  an  illustrative 
example:  he  shous,  for  example,  that  in  plosive  consonants  such  as 
p,  t,  and  k,  the  direction  of  the  formant  transition  following  the 
noise  burst  often  determines  which  consonant  is  pc-rceived. 

Experiments  were  conducted  to  attempt  to  isolate  selected 
speech  fee  ures  in  locjIS  (k,  n)  I , and  determine  which  regions  were 
important  to  perception.  Three  features  were  selected  as 

typical  - pitch,  formants,  and  plosive  noise  bursts.  The  logarithm 
was  taken  to  model  the  approximate  logarithmic  sensitivity  of  the 
auditory  system.  In  addition,  it  was  desired  to  isolate  these 
features  by  two  dimensional  filtering.  Bandpass  filtering  of  the 
magnitude  or  squared  magnitude  of  S(k,n)  produces  functions  which 
are  not  everywhere  positive,  and  this  limits  its  application  to 
speech  processing. 

A 25.6  msec  hanning  window  was  used  for  calculation  of  the 
short-time  Fourier  transform.  The  window  uas  augmented  with  zeros 


37 


to  twice  this  length  to  increase  frequency  direction  sampling  of 
S(k,n)  and  minimize  aliasing  in  loglS(k,n)|.  The  normal  time 
sampling  of  S(k,n)  for  this  window  (sampling  period  3.2  msec)  was 
sufficiently  conservative  to  limit  time  direction  aliasing  of 
loglS(k,n)l.  The  two  dimensional  filters  used  were  zero  phase  with 
flat  passbands.  Half  cosine  transition  regions  were  used  to 
minimize  ringing  effects. 

The  results  of  these  experiments  are  summarized  in  Figure  10, 
which  shows  the  approximate  regions  of  the  Fourier  transform  of 
loglS(k,n)|  occupied  by  pitch  information,  formant  information,  and 
plosive  information.  The  distinction  between  steady  state  formants 
and  formant  transitions  in  Figure  10  is  somewhat  arbitrary.  This 
reflects  the  fact  that  there  are  few  sections  of  speech  which  are 
clearly  steady  state.  One  should  keep  in  mind  that  only  the  low 
frequency  component  of  the  speech  features  represented  in  Figure  10 
is  accessible  to  processing  based  on  the  magnitude  only,  because  of 
the  smoothing  effect  discussed  in  Chapter  2. 

The  regions  shown  ir  Figure  10  generally  represent  male  speech. 
For  female  speech,  the  pitch  period  sometimes  decreases  below  3.0 
msec  so  the  lower  boundary  for  pitch  features  extends  into  the 
formant  region. 

Figures  11  and  12  6hou  speech  features  obtained  by  two 
dimensional  filtering  of  log!S(k,n)|.  These  features  were  obtained 
by  multiplying  the  Fourier  transform  of  log!S(k,n)|  by  a filter 
corresponding  to  the  aopropriate  region  in  Figure  10,  inverse 
transforming,  and  exponentiating.  Each  of  the  pictures  has  been 
scaled  to  use  the  full  dynamic  range  of  the  film,  so  the  figures  do 


38 


not  indicate  the  amplitude  of  the  features  relative  to  the  original 
speech.  The  pitch,  formants,  and  plosives  have  much  louer  dynamic 
range  than  the  original  speech.  Host  of  the  dynamic  range  of  the 
original  i9  in  the  low  frequency  area  of  Figure  10.  This  is  shown 
in  Figure  12(c),  which  shows  the  features  obtained  by  high  pass 
filtering  with  zero  low  frequency  gain  and  transitions  at 
f„  - .5  msec,  f„  - 25  Hz. 

An  interesting  feature  of  these  results  is  that  the  slouly 
changing  component  of  logIS (k, n)  I seems  to  be  of  lesser  importance 
to  perception.  This  is  supported  by  characteristics  of  the  auditory 
system.  The  signal  in  the  auditory  nerve  exhibits  "temporal 
adaptation"  [10,121  - that  is,  the  auditory  system  adapts  quite 
rapidly  (a  few  tens  cf  milliseconds)  to  changes  in  stimulus 
intensity  This  removes  the  slowly  changing  temporal  component  and 
reduces  the  dynamic  range  of  the  neural  signal.  The  auditory  system 
also  exhibits  "two-tone  inhibition"  [9,211  - the  threshold  for 
hearing  a test  tone  is  increased  by  nearby  tones.  This  may 
attenuate  the  component  of  the  signal  in  the  auditory  system  which 
changes  slowly  with  frequency,  similar  to  lateral  inhibition  in  the 
visual  system  [221.  It  is  interesting  to  note  that  effects  which 
attenuate  the  slowly  changing  component  in  a signal  can  be  modeled 
mathematically  by  high  pass  filtering  the  logarithm  of  the  signal 
magnitude  [231,  as  was  dene  in  Figure  12(c). 

This  discussion  suggests  that  a signal  synthesized  from  Figure 
12(c)  should  have  low  dynamic  range  but  still  be  highly 
intelligible,  and  this  is  in  fact  the  case.  Speech  processed  to 
attenuate  the  low  frequency  component  of  the  short-time  spectrum 


39 


might  therefore  be  more  intelligible  than  normal  speech  in  a noisy 
environment.  Informal  listening  tests  and  the  results  of  the  next 
chapter  support  this  notion. 


40 


t / 

fk  (msec)  i / 


FIGURE  10 

Areas  of  the  Fourier  transform  of  loglS(k,n)l  occupied  by 
speech  features:  F represents  formants,  P pitch,  and  N plosive  noise 
bursts.  The  dotted  line  represents  an  approximate  division  between 
steady  state  formants  anc  formant  transitions. 

I 

£ 


(b) 


FluUHE  11 

Speech  features  obtained  by  two  dimensional  filtering 
of  log|S(k,n)|:  (a)  original  speech,  (b)  pitch,  (c) 
formants.  Each  picture  represents  1.6  seconds  of  speech, 
and  has  been  scaled  to  use  the  entire  dynamic  range  of  the 
film.  The  speech  is  "the  pipe  began  to  rust."  The 
spectrogram  in  (a)  has  been  scaled  6 dB/oct  above  *400  Hz. 


FIGUHE  12 

Speech  features  obtained  by  two  dimensional  filtering 
of  log|S(k,n)|:  (a)  formants,  (b)  plosives,  ( <.  ) slowly 
varying  component  removed.  figure  (a)  was  obtained  from 
Figure  11(c)  by  clipping  below  101  of  the  maximum  value  to 
remove  small  background  features. 


f 

1 


CHAPTER  5 


TUO  DIMENSIONAL  COMPRESSION  AND  EXPANSION 

5.1  Review  of  Homomorphic  Compression  anci  Expansion 

Oppenheim  et  al  £23]  investigated  a homomorphic  system  for 
compression  and  expansion  of  acoustic  signals.  The  system  is  based 
on  modeling  audio  signals  as  a product  of  two  components  - a slowly 
varying,  positive  envelope  signal  e(n)  and  a rapidly  varying  bipolar 
signal  v (n) . 

a In)  ■ e (n)  *v  H)  (5.1) 

These  multiplied  signals  can  be  mapped  into  added  signals  by  the 
logarithm. 

log  Is  (n)  I ■ Icgle  (n)  I + loglv(n)|  + i£v  (n)  (5.2) 

The  imaginary  part  ziv(n)  is  either  0 or  n,  and  represents  the  sign 
of  9(n).  If  the  frequency  bands  occupied  by  log le (n) I and  log|v(n)l 
do  not  overlap,  these  signals  can  be  separated  by  linear  filtering. 
Although  this  will  not  be  the  case  for  intuitive  definitions  of  e(n) 
and  v(n),  Reference  23  provides  evidence  that  the  component  of 
1 ogl a (n)  I below  about  16  Hz  should  be  treated  as  1 ogle (n)  I . 

A multiplicative  filter  to  compress  or  expand  s(n)  can  be 
obtained  using  this  model.  The  logarithm  is  computed  as  in  5.2,  and 
the  real  part  is  filtered  so  that  logle(n)|  is  multiplied  by  a 


constant  p,  while  log|v(n)|  is  passed  unchanged.  The  resulting 
signal 


s'(n)  - e^nJ'vtn)  (5.3) 

is  compressed  or  expanded  as  p is  less  or  greater  than  unity.  A 
block  diagram  of  this  system  is  shown  in  Figure  13.  H(f)  is  a 
filter  with  gain  p below  16  Hz  and  unity  gain  at  higher  frequencies. 

Figure  14  is  a compression-expansion  system  for  transmission 
through  a noisy  channel.  The  blocks  labelled  "compress"  and 
"expand"  are  homomorphic  systems  like  that  shown  in  Figure  13.  The 
linear  filter  used  for  expansion  is  the  inverse  of  the  filter  used 
for  compression.  The  block  "T"  represents  the  effect  of  channel 
noise;  i.e..  audio  tape  hiss  or  quantization  noise. 

The  final  output  of  the  system,  s'"(n),  is  precisely  equal  to 
the  input  s(n)  only  in  the  case  where  T is  the  identity  system. 
Small  amounts  of  additive  noise,  however,  do  not  significantly 
distort  the  output.  Compression  and  expansion  of  the  signal  makes 
the  signal  to  noise  ratio  more  uniform  over  time,  and  improves 
performance  over  that  obtained  with  transmission  of  the  uncompressed 
signal . 

5.2  Two  Dimensional  Compression  and  Expansion 

In  situations  where  perception  of  the  output  signal  is 
important,  the  performance  of  the  system  of  Figure  14  may  not  always 
be  optimum.  In  speech  ard  music,  there  are  often  occasions  when  the 
overall  signal  level  is  high,  but  most  of  the  signal  energy  is 
concentrated  in  a relatively  small  band  of  frequencies.  In  such 


45 


instances  noise  at  other  frequencies  may  be  audible  since  it  is  not 
masked  by  the  input  signal. 

A compression-expansion  system  which  operates  in  the 
time/frequency  domain  would  alleviate  this  problem  by  compressing 
frequency  bands  more  or  less  independently.  Such  a system  would 
have  the  effect  of  pre-whitening  the  input  signal,  which  is 
desireable  on  a more  theoretical  basis  [18,24]. 

A two  dimensional  compression-expansion  system  can  be  obtained 
following  the  one  dimensional  homomorphic  analysis.  The  short-time 
Fourier  transform  is  modeled  as  a product 

S(k,n)  - E (k,  r)  *V  (k,  n)  (5.4) 

where  E(k,n)  is  slowly  changing  and  positive  and  V(k,n)  is  rapidly 
changing  and  complex.  This  model  has  successfully  been  used  as  the 
basis  of  several  speech  analysis-synthesis  systems  (vocoders),  and 
has  some  physical  basis  [7J . It  leads  to  the  two  dimensional 
multiplicative  filter  shown  in  Figure  15,  which  compresses  S(k,n) 
when  H(f„,f„)  has  low  frequency  gain  less  than  one.  A 
compression-expansion  system  for  the  time  waveform  s(n)  can  now  be 
constructed  by  adding  the  necessary  transforms  between  the  time 
signal  and  S(k,n).  The  resulting  two  dimensional 
compression-expansion  system  is  shown  in  Figure  1G.  (In  Figure  1G, 
STFT  represents  the  short-time  Fourier  transform,  2.7,  and  2k.'o 
represents  the  reconstruc tion  equation,  2.8.) 

The  system  of  Figuro  1G  is  similar  to  that  of  Figure  14,  but  a 
new  theoretical  problem  exists.  Figure  1G  i6  not  the  identity 
system,  even  in  the  case  of  a noiseless  channel.  Synthesis  of  a 


46 


time  signal  from  S'(k,n)  followed  by  reanalysis  to  produce  S"(k,n) 
is  not  in  general  an  identity  operation.  However,  as  discussed  in 
Chapter  2,  S'(k,n)  will  be  approximately  equal  to  SMk.n)  so  long 
36  changes  to  S(k,n)  vary  slowly  in  time  (compared  to  the  window), 
and  in  frequency  (compared  to  the  Fourier  transform  of  the  window) , 
This  means  that  the  compression  and  expansion  filters,  H(f„,fk)  and 
H*‘ ( f„,  f *) , can  differ  from  unity  only  in  the  low  frequency  region. 
This  is  consistent  with  the  aim  of  compression  and  expansion,  and 
accurate  reproduction  of  logIS'fK.n)  i proved  not  to  be  a problem  in 
this  experiment.  The  dynamic  range  of  S(k,n)  could  be  compresed  4:1 
and  reexpanded  with  an  accuracy  of  0.3%  or  better.  The  worst  errors 
occurred  during  rapid  transients,  as  uould  be  expected.  This 
accuracy  i6  comparable  to  that  obtained  for  4:1  compression  and 
oxpansion  with  a homomorphic  system  like  that  in  Figure  14. 

5.3  Comparison  of  Two  Dimensional  and  Homorphic  Compression 

The  two  dimensional  system  of  Figure  16  uill  not  in  general 
lead  to  the  same  compressed  signal  envelope  as  the  homomorphic 
system  of  Figure  14.  The  reason  can  be  understood  with  the  help  of 
a simplified  example.  Consider  a signal  which  initially  has  a flat 
spectrum  (e.g.,  white  noi6e),  and  assume  that  at  some  later  time  the 
high  frequency  half  of  the  signal  spectrum  ha6  increased  by  a factor 
b.  Ue  compute  the  ratio  of  the  final  power  of  the  compressed  signal 
to  its  initial  power,  anc  compare  the  ratio  for  the  two  systems. 

Assuming  that  the  initial  and  final  signals  are  locally 
stationary,  the  power  can  be  estimated  from  the  square  of  the 
signal.  In  the  homomorphic  case: 


47 


P,  ~ It-L  lep ( j ) • v ( j + D I*  - e^-St-L  !v(j  + l)ll. 
R,  - Pf/Pt  - lef2/ei,)'\ 

R,  - (1/2  + b/2)'. 


(5.5) 
(S.  S) 
(B.  7) 


Equation  S.7  uas  obtained  using  Parseval’s  theorem  and  the  Knoun 
signal  spectrum.  In  the  two  dimensional  cases 


Pj  ~ ZU l Is ( j + D l2. 

~ ZU  128ft  E»(k,  j)-V(k.  j+D-exptiO.tj+Ull*. 
~ Z'i.. IE'(k,  j)-V(k,  j+K)l*, 
p.  ~ lE’lk.j)!*  • Si-.-l  IV  (k,  j+K)  !'• 


(B.8) 
(B.  9) 
(B. 10) 
(B.  ID 


The  absolute  square  can  be  taken  inside  the  sum  in  S.10 
E*  (k,  j ) *V  (k,  j ) - exp  (iOk  j ) are  approximately  nonoverlapping  band  pass 
signals,  and  therefore  the  cross  products  in  B.9  sum  to  zero  when 
time  averaged.  Assuming  that  the  time  average  behavior  of  V(k.n)  is 
the  same  for  all  k,  we  obtain  the  desired  ratio. 


R,  - P,/P;  - Hf.’o  lE*(k.f)l’)/(Zr:S  lEMk.iirJ 

R,  - 1/2  + bV2 


(B. 12) 
(B. 13) 


Figure  17  la  a plot  of  the  ratio  R,/fl.  for  the  example  just 
described,  and  for  the  case  uhere  174  of  the  spectrum  increases. 
The  curves  uere  obtained  uith  p-0.25,  but  are  not  a sensitive 
function  of  p.  They  ehou  uhat  might  be  expected  on  a physical 

basis!  the  tuo  dimensional  system  compresses  the  signal  more 
the  energy  is  concentrated  in  a norrou  frequency  band.  This  effect 

was  observed  experimentally. 


48 


5.4  Experimental  Results 

Tuo  experiments  uere  performed  uith  the  tuo  dimensional  system 
of  Figure  IS.  The  experiments  were  intended  to  simulate 
transmission  of  speech  through  an  analog  and  a digital  channel.  In 
both  wases,  the  result!  were  judged  by  comparison  with  a similar 
homomorphic  system. 

A 20.0  msec  hanning  window  was  used  for  the  short-time  Fourier 
transform  to  provide  maximum  frequency  resolution  consistent  with 
time  direction  compression  below  about  IF  The  linear  filter  in 

the  compressor  had  a unit  sample  response  equivalent  to  the 
continuous  function 

h (k, n)  » 51k, r)  - 0.75*w(k,iv  (5.14) 

u(k,n)  - A- [.5-*-.5*cos (nk/3‘j>-  1 • [,5+.5*cos(nn/,04) ) (5.1b) 

-.04  < n < .04  ; -390.  < k < 390. 

where  t is  in  seconds  and  k is  in  Hz.  A is  a normalizing  factor  so 
that  the  area  under  u(k,n)  is  unity.  This  resulted  in  a high  pass 
filter  with  transitions  at  about  f*  - 13  Hz,  f„-l.S  msec,  and  a 
low  frequency  gain  of  C.25.  The  impulse  response  uas  specified 
rather  than  the  frequency  response  because  a smooth  impulse  response 
was  essential  to  accurate  expansion.  The  filter  in  the  homomorphic 
system  used  for  comparison  was 

h(n)  - 5 (n)  - 0.75-B*  [.5+.5*cos(nn/.04)l  (5. 15) 

where  B is  agair.  a normalizing  factor. 

An  analog  channel  was  simulated  by  adding  gaussian  distributed 
white  noise  to  the  compressed  speech,  s'fn) , at  levels  from  -35  dB 


49 


to  -12  dB.  The  level  of  the  noise  was  referenced  to  the  peak 
compressed  signal  energy  determined  by  averaging  the  squared  signal 
with  a 50  msec  hanning  window.  The  peak  power  occurred  at  the  same 
place  (during  a vowel  sound)  for  both  homomorphic  and  two 
dimensional  compression,  and  the  peak  power  criterion  did  not  appear 
to  bias  the  results  in  favor  of  either  method.  Figures  18  and  19 
show  the  compressed  speech  s'(n),  compressed  speech  plus  noise 
s"(n),  and  the  expanded  output  speech  s'"(n)  for  the  homomorphic 
system  (Figure  14)  and  the  two  dimensional  system  (Figure  16). 
Figure  20  shows  a comparison  of  the  output  speech  for  the  two 
systems  and  an  example  where  no  compression  was  used.  Figure  21 
chows  spectrograms  of  the  output.  (The  noise  above  4000  Hz  in 
Figure  21(b)  was  removed  by  the  anti-imaging  filter  on  playback.) 
All  four  figures  are  for  a channel  signal  to  noise  ratio  of  12  dB. 

The  two  dimensional  system  provided  18  dB  improvement  over  the 
homomorphic  system.  Noise  was  first  audible  in  the  output  of  the 
two  dimensional  system  at  a channel  signal  to  noise  ratio  of  12  dB, 
compared  to  30  dB  for  the  homomorphic  system.  Some  of  the 
improvement  can  be  attributed  to  the  removal  of  the  natural  high 
frequency  rolloff  of  speech  by  compression  of  S(k,n),  as  evidenced 
in  Figure  21.  However,  flattening  the  average  speech  spectrum  by 
preemphasis  at  6 dB/oct  above  400  Hz  prior  to  compression  and 
complementary  deemphasis  of  the  output  signal  improved  the 
performance  of  the  homomorphic  system  by  only  ~3  dB. 

A digital  channel  was  simulated  by  quantizing  the  compressed 
signal  to  a small  number  of  bits  (3-8).  The  quantizer  is  shown  in 
Figure  22.  Since  envelope  distortion  caused  by  clipping  is 


50 


particularly  noticeable  in  both  "items,  the  compressed  speech  was 
scaled  so  that  the  peak  signa.  . . .s  equal  to  2n‘l,  where  n is  the 
number  of  bits.  A dithe-ing  signal  was  added  prior  to  quantization 
and  subtracted  afterward  to  break  up  the  correlation  between  the 
quantization  noise  and  the  signal  [25,263.  The  dithering  signal 
used  was  uniformly  distributed  pseudorandom  noise,  with  peak 
magnitude  equal  to  1/2  quantization  level  and  zero  mean. 

The  results  of  this  experiment  are  consistent  with  the  analog 
channel  simulation.  The  two  dimensional  system  provided  a three  bit 
improvement  over  the  homcmorphic  system.  Noise  was  first  audible  in 
the  output  of  ths  two  dimensional  system  with  four  bit  channel 
quantization,  compared  tc  seven  bit  quantization  for  the  homomorphic 
system.  For  comparison,  quantization  noise  was  first  audible  in  the 
uncompressed  signal  at  nine  bit  quantization.  (Fifteen  oit 
quantization  was  used  for  the  input  and  output  speech.)  The  apparent 
loss  of  6 dB  signal  to  noise  ratio  compared  to  the  analog 
experiment,  based  on  the  "6  dB  per  bit"  rule  of  thumb,  is  due  to  the 
difference  in  criteria  fcr  scaling  the  compressed  signal. 

The  waveforms  for  three  bit  quantization  of  the  channel  signal 
are  nearly  indistinguishable  from  Figures  18  - 21.  The  same  is  true 
of  listening  tests. 


Mi 


51 


FIGURE  13 

Homomorphic  filter  for  compression  or  expansion  of  acoustic 
signals. 


COMPRESS 


T 

s 

T 

* 

FIGURE  14 

Homomorphic  compression/expansion  system  for  transmitting 
acoustic  signals  through  a noisy  channel. 


52 


FIGURE  15 

Two  dimensional  homomorphic  filter  for  compression  or  expansion 
of  IS(k,n)  I. 


FIGURE  16 


Two  dimensional  compression/expansion  system  for  transmitting 
acoustic  signals  through  a noisy  channel. 


53 


R,/A» 


FIGURE  17 

Comparison  of  homoncrphic  and  two  dimensional  compression.  The 
curves  show  the  ratio  of  two  dimensional  compression  to  homomorphic 
compression,  R*/Rt,  as  a function  of  the  change  in  a portion  of  the 
spectrum  of  the  input  signal,  b.  The  two  curves  represent  changes 
in  1/2  and  1/4  of  the  spectrum,  as  labelled. 


FIGURE  13 


Compression  and  expansion  of  speech  usmr  a homomorphic 
system:  (a)  original  speech,  (b)  compressed  speech,  (c) 
compressed  speech  with  noise  added,  (d)  expanded  output 
speech.  bach  picture  represents  one  second  of  speech. 
Channel  signal  to  noise  ratio  is  12  d 3 . 


FI  (JUKE  20 


Comparison  of  output  speech  for  homomorphic  and  two 
dimensional  systems:  (a)  input  speech,  (b)  no  compression, 
(c)  homomorphic  system,  (d)  two  dimensional  system.  Eacli 
picture  represents  one  second  of  speech.  Channel  signal  to 
noise  ratio  is  12  d B . 


(c  ) 


FIGURE  2 1 


Spect  roc,  rams  of  output 
d iTensional  systems:  (3) 

system,  (c)  two  dimensional 
ratio  II  d u . All  three 
i dB/oct  above  400  Hz. 


speech  for  homomorphic  and  two 
input  speech,  (b)  homomorphic 
system.  Channel  signal  to  noise 
spectrograms  have  been  scaled 


i 


FIGURE  22 

Quantizer  used  for  digital  channel  simualtion.  Three  bit 
quantization  is  shown. 


1 


CHAPTER  S 


REMOVAL  OF  LOCALLY  PERIODIC  INTERFERING  SIGNALS 

G.l  Removal  by  Spectrum  Estimation 

Cases  arise  where  speech  intelligibility  is  degraded  by  locally 
periodic  signals.  Electrical  interference  in  communications 

channels  is  sometimes  oeriodic  with  many  harmonics.  Aircraft 
cockpit  noise  often  has  a large  periodic  component  due  to  engine 
noise.  The  precise  fundamental  frequency  and  harmonic  structure  of 
these  signals  is  typically  hard  to  predict,  and  the  signals  change 
from  time  to  time.  Electrical  interference  may  be  intermittant  or 
fade,  and  cockpit  noise  changes  with  engine  speed.  For  times  of  the 
order  of  seconds,  however,  these  signals  appear  periodic. 

The  short-time  spectrum  of  periodic  signals  is  characterized  by 
sharply  defined  harmonic  structure.  An  adaptive  filter  which 
selectively  attenuates  such  features  can  be  obtained  using  a 
technique  similar  to  that  developed  by  Stockham  at  al.  1271  for 
blind  deconvolution,  in  which  the  actual  spectrum  of  the  noisy 
signal  is  replaced  with  a prototype  spectrum  obtained  from 
undegraded  speech.  The  advantage  of  this  approach  is  that  precise 
knowledge  of  the  frequency  ano  harmonic  structure  of  the  interfering 

signal  is  not  required. 

The  local  spectrum  of  the  degraded  speech  can  be  estimated  by 


time  averaging  the  short-time  spectrum 


60 


0>(k,n)  at  (2L+1  )'l2h-L  tS  (k , n+l ) lJ  <6. 1 ) 

where  the  length  of  the  time  average  is  determined  by  the  stationary 
time  of  the  interfering  signal.  The  estimated  spectrum  in  6.1  will 
have  two  components,  ore  due  to  the  interfering  signal  and  another 
due  to  the  local  average  speech  spectrum.  If  the  averaging  time  is 
long  enough,  the  local  average  speech  spectrum  can  be  approximated 
by  a prototype  computed  from  similar  undegraded  speech.  Portions  of 
the  local  spectrum  which  are  dominated  by  noise  can  then  be  restored 
by  multiplying  the  short-time  Fourier  transform  by 

H (k, n)  - CP(k)/0(k,n)I1'*  (6.2) 

where  P(k)  is  the  prototype  spectrum. 

Equation  6.2  is  the  square  root  of  the  Uiener  filter  which 

would  be  obtained  if  we  modeled  speech  as  a stationary  random 
process,  uncorrelated  with  the  interfering  signal.  It  might 

therefore  be  expected  that  6.2  will  not  adequately  attenuate 
frequencies  dominated  by  the  interf  erance.  This  is  also  to  be 

expected  because  6.2  only  restores  the  magnitude  of  the  short-time 
transform,  and  portions  of  the  transform  which  were  dominated  by 
noise  will  still  sound  like  noise  due  to  the  phase.  However, 
squaring  6.2  may  not  be  advisable,  because  this  will  emphasize 

distortions  caused  by  the  fact  that  P(k)  is  not  precisely  the  local 
average  speech  spectrum.  Equation  6.2  is  therefore  replaced  with 

H (k, n)  - CP(k)/<t>(k,n)3'  (6.3) 

where  p is  between  0.5  and  1.  It  was  found  experimentally  that  for 


i 


61 


averaging  times  of  the  order  of  a second  p-0.75  represents  a good 
compromise  between  noise  reduction  and  distortion  of  the  underlying 
speech. 

6.2  Removal  by  Two  Dimensional  Filtering 

The  short-time  spectrum  can  alternately  be  restored  by  two 
dimensional  filtering  of  log  US  (k,  n)  |*i  . This  approach  does  not 
require  a prototype  speech  spectrum,  and  so  may  be  useful  when  the 
etationary  time  of  the  irterfering  signal  is  relatively  short. 

This  approach  is  based  on  the  fact  that  low  pass  filtering  of 
log  (IS  (k,  n)  I*}  ir  the  time  direction  is  equivalent  to  estimating 
log  (<J>(k,  n) ) by  a weighted  time  average  of  log  IIS  (k,  n)  1*1  . Spectrum 
estimates  based  on  time  averages  of  the  logarithm  of  the  local 
spectrum  are  biased  toward  stationary  spectral  components  [27].  In 
the  present  application,  this  bias  is  an  advantage  since  it 
emphasizes  the  spectrum  cf  the  interfering  signal. 

Time  direction  low  pass  filtering  attenuates  all  points  in  t e 
(two  dimensional)  frequercy  domain  except  those  which  are  on  or  near 
the  f„  axis.  If  the  interfering  signal  has  appreciable  harmonic 
structure,  features  in  the  Fourier  transform  of  logllS(k,n)|  / due 
to  the  interfering  signal  lie  mainly  oeyond  If*.  ■ t,  where  r is  the 
period  of  the  interfering  signal.  Features  for  lfki<T  are 
primarily  due  to  the  local  speech  spectrum.  The  spectrum  of  the 
interfering  signal  can  therefore  be  estimated  with  a filter  Lif„,f*) 
which  isolates  components  near  the  f»  axis  for  |f*l  > r.  Such  a 
filter  is  shown  in  Figure  23.  This  filter  can  be  implemented  quite 
efficiently,  since  its  frequency  response  land  impulse  response)  are 
separable. 


62 


The  magnitude  of  the  short-time  spectrum  can  be  restored  by 
filtering  log  (IS  (K.  n)  I *1  with 

H(f„,fJ  - 1 - P-Uf„.fJ.  lB<4) 
where  p is  a number  between  0.5  and  1 analogous  to  the  power  in  6.3. 
6.3  Experimental  Results 

To  evaluate  the  processes  discussed  above,  a test  signal  was 
constructed  by  adding  an  assymetric  square  wave  (positive  signal  1.5 
times  as  long  as  negative  signal)  to  speech.  The  frequency  of  the 
square  wave  was  changed  from  time  to  time.  The  approximate  signal 
to  noise  ratio  was  -26  d£,  and  the  speech  was  unintelligible. 

The  short-time  Fourier  transform  of  the  degraded  speech  was 
calculated  with  a 102.4  msec  hanning  window.  A longer  window  (~200 
msec)  would  have  better  resolved  the  interfering  signal,  but  the 
length  of  the  window  was  limited  by  computer  memory  requirements  in 
the  implementation  of  the  short-time  transform  usea  here.  IS ( K , n)  I 
was  modified  using  6.3  with  p-0.75,  and  an  averaging  time  of  one 
second  for  0(K,n).  The  prototype  was  obtained  from  a different 
speaKer. 

The  time  waveform  and  a spectrogram  of  the  degraded  and 
restored  signals  are  shown  in  Figure  25.  The  interfering  signal  is 
greatly  at tenua t« d,  excep.  near  places  where  the  frequency  of  the 
interfering  signal  changes.  The  resulting  speech  is  intelligible, 
end  is  sufficiently  natural  that  the  speaKer  is  recognizable. 

IS  Ik,  n)  I was  also  restored  by  filtering  log  IIS  (k,  n)  I*)  as  in 
C.4,  with  p-0.75  and  a cutoff  frequency  fB-0.5  Hz. 


In  this  case 


^ 


63 


the  noise  was  almost  entirely  removed,  except  near  frequency 
changes,  and  the  resulting  speech  was  quite  intelligible.  The 
speech  was  more  natural  and  less  noisy  than  that  obtained  by 
spectrum  averaging.  The  processed  signal  is  shown  in  Figure  25. 

Filtering  of  log  ( IS  (k , n)  |*>  was  also  applied  to  removal  of 
electrical  interference  from  several  tape  recorded  signals  obtaineo 
in  more  realistic  circumstances.  The  interfering  signals  were 
almost  completely  removed  in  each  case.  However , the 
intelligibility  of  the  processed  speech  was  sometimes  poor  due  to 
othor  distortions,  such  as  resonances  in  the  recording  system  and 
tape  saturation. 


I 

t 


FIGURE  23 


Frequency  response  of  filter  for  removal  of  periodic 
interfering  signals.  The  origin  is  at  the  center.  The 
passbands  are  along  the  f*  axis  for  f^  >r  . (The  picture  is 
not  to  scale.) 


FIGURE  2H 


Original  (undegraded)  speech  used  in  experiment  on 
removal  of  locally  periodic  interference  shown  in  Figure  25. 


(c) 


FIJUKE  25 

Removal  of  locally  periodic  interference:  (a)  speech 
degraded  with  square  wave  noise,  (b)  restored  by  spectrum 
averaging  (c)  restored  by  two  dimensional  filtering.  All 
three  spectrograms  have  been  scaled  6 dB/oct  above  400  Kz. 


CHAPTER  7 


CONCLUSIONS 

Ths  experiments  described  here  indicate  that  acoustic  6ignal 
processing  in  the  time/frequency  domain  can  offer  significant 
advantages  over  conventional  one  dimensional  methods,  particularly 
when  perception  is  the  final  measure  of  system  performance.  These 
advantages  are  gained  at  the  expense  of  computational  complexity, 
but  the  computing  hardware  exists  (e.g.,  fast  array  processors)  to 
make  time/frequency  processing  practical  when  the  problem  merits. 
The  chief  disadvantage  of  time/frequency  processing  is  the  paucity 
of  concise  mathematical  results  describing  the  effect  of  such 
processing. 

All  of  the  experiments  described  here  were  based  on  the 
magnitude  of  the  short-time  Fourier  transform.  Phase  information 
was  used  only  to  reconstruct  a time  signal.  In  at  least  two  cases, 
the  results  might  be  improved  using  phase  information. 

Removal  of  broad  band  background  noise  by  thresholding  as 
described  in  Chapter  3 attenuates  signal  harmonics  when  their 
magnitude  remains  too  near  the  noise  level.  It  is  possible  that 
these  harmonics  could  be  detected  in  the  phase.  Consider  £S(k,n)  as 
a function  of  time.  Noise  will  have  phase  randomly  distributed 
between  -n  and  n,  while  coherent  signals  such  a6  harmonics  will  have 
nearly  linear  phase.  The  phase  time  derivitive,  therefore,  should 


67 


have  appreciable  high  frequency  components  only  when  the  ehort-time 
transform  is  locally  dominated  by  noise.  Techniques  for  detection 
of  Ffl  signals  in  noise  might  even  be  applied  to  separate  out  the 
component  of  the  phase  due  to  the  signal  and  reject  that  due  to 
noise. 

Locally  periodic  interference  like  that  discussed  in  Chapter  6 
might  be  rejected  on  similar  criteria.  In  this  case,  however,  the 
portions  of  S(k,n)  dominated  by  the  interfering  signal  will  have 
linear  phase  while  those  due  mainly  to  speech  will  be  more  random. 
Attenuation  of  the  interfering  signal  based  on  phase  criteria  might 
produce  less  distortion  in  the  underlying  speech  than  the  methods  of 
Chapter  6. 

It  was  mentioned  in  the  Introduction  that  the  effectiveness  of 
the  short-time  Fourier  transform  as  a perceptual  model  was  limited 
because  it  is  constant  bandwidth,  while  the  ear  is  more  nearly 
constant  Q.  The  potential  advantages  of  a constant  Q transform  were 
also  observed  experimentally.  For  example,  in  background  noise 
removal,  narrower  bandwiths  could  be  used  to  better  resolve  low 
frequency  components,  since  the  low  frequency  harmonics  change 
slowly  with  time.  The  same  is  true  for  analysis  of  any  signal  with 
harmonic  structure  - increasing  the  bandwidth  with  frequency  would 
provide  more  nearly  optimum  resolution  of  the  entire  spectrum,  since 
high  frequency  harmonic:6  change  more  rapidly.  A reasonably 
efficient  algorithm  for  "he  constant  Q equivalent  of  the  short-time 
transform  would  certainly  improve  some  of  the  re6ult6  obtained  here. 


and  would  make 

possible 

detailed 

studies  of 

the 

perception 

of 

acoustic  stimuli 

similar 

to  thoce 

which  have 

been 

done  for 

the 

visual  system  [281 . 

One  application  of  time/ frequency  processing  not  considered 
here  which  appears  especially  promising  is  hearing  aids  for  the 
deaf.  A large  number  of  persons  with  impaired  hearing  suffer  from 
recruitment  [29] , a frequency  dependent  reduction  in  the  dynamic 
range  of  the  auditory  system.  Recruitment  is  usually  due  to  damage 
to  tho  hair  cells  or  auditory  nerve,  and  raises  the  threshold  of 
hearing  in  the  region  affected.  The  threshold  of  discomfort  i6  not 
changed,  however,  so  that  it  is  often  not  practical  to  preemphasize 
the  signal  sufficiently  to  compensate  for  the  hearing  loss.  The 
results  in  Chapter  4 indicate  that  frequency  dependent  compression 
based  on  the  changes  in  threshold  of  hearing  might  be  more 
effective,  particularly  in  improving  speech  intelligibility. 


/ 


APPENDIX  A 

A LiniT  ON  THE  UNCERTAINTY  PRODUCT  FOR  THE  SHORT-TinE  SPECTRUn 

In  this  appendix,  3 relation  i6  derived  for  the  short-time 
spectrum  uhich  is  similar  to  the  tuo  dimensional  uncertainty 
principle  in  quantum  mechanics.  This  result  shows  how  the  usual 
uncertainty  relation  for  signals  and  window  functions  carries  over 
to  the  short-time  spectrum.  Ue  define 

A't  - S.ZS.Z  (t-t#)MS(of  t)  I*  dudt  , (A.l) 

S.ZS.Z  IS(u,t)lJ  dudt 

A'u  - j:: j;:  ( ->-u0>*-is («. t)  12  dudt  . ca. z> 

S.ZS.Z  IS (u, t) I*  dudt 

It  will  be  shown  in  this  appendix  that 

A« At  * 1.  (A. 3) 

The  short-time  fourier  transform  is  defined  111 

S(u.t)  • XI  f (x)  *m  ( t-x)  *exp  (-iux)  dx  (A. 4) 

• (2n)*'XI  F (X+u)  *n (X)  *exp  (iXt)  dX  . (A.  5) 

The  short-time  spectrum  is  the  squared  magnitude  of  the  short-time 
Fourier  transform.  Equation  A, 5 is  obtained  by  viewing  A. 4 as  a 
convolution,  and  using  the  convolution  theorem  to  rewrite  it  as  the 
Fourier  transform  of  a product.  F(u)  and  M(u)  are  the  Fourier 


transforms  of  f(t)  and  m(t). 


70 


First,  the  denominator  of  A.l  and  A. 2 are  evaluated  using 
Parseval’s  theorem  and  the  above  definition.  Ule  uill  use  the 
notation  ||g||  - |g  (>o)  |=  d/o  . 

SZS.Z  IS (u,  t)  I*  do  dt 

- 2n  j::  j;:  If  (x)  Ia.|m(t-x)  I2  dx  dt 


- 2n||f|H|m|| 

(A.  6) 

S.ZS.Z  IS (o,  t)  1*  do  dt  - (2n) "‘IlFlMinil 

(A. 7) 

Ue  now  evaluate  o0  and  t0,  again  using  Parseval’s 

theorem.  For 

simplicity,  it  is  assumed  that  the  uindow  is  real 

and  centered  in 

time. 

t0  - t*is to, t) ia  dodt 

2nlif||-limf| 

(A. 8) 

- x: j;:  t-i f <x>  I'-im (t-x>  i*  dxdt 

IlflHImll 

(A. 9) 

- x:  x:  (t+:<).|f  (x)|*.|m(t)|*  dxdt 
' IlflHMI 

(A. 10) 

■ x*l  f (x)  1*  dx 

llfll 

(A. 11) 

The  first  term  in  A.  10  integrates  to  zero  because 

the  uindou  is 

centered. 

o0  - 2nX»XI  o- IS  (o , t ) 1*  dodt 
ifii-iimi 

(A. 12) 

- O-IF  Ov-ro)  |=-in  (A)  I*  dodX 

ifiMinii 

(A. 13) 

- S.ZS.Z  (o-A).|F(o)l,-m(A)|J  do  dA 
IlFlMinil 

(A. 14) 

S.Z  o-|F  (u)  I*  do 
If  II 


(A. 15) 


71 


Ue  are  now  in  a position  to  evaluate  A1*.  A>u  in  terms  of  the  signal 
and  the  window. 


AJt  - j::  j::  (t-t*)*-lf  (x)  l*-|mCt-Kl  I*  dxdt  (A. 16) 

IiflMlmll 

- S.ZS.Z  (t+x-t.)**lf  (x)  la-|m  (t)  I*  dxdt  (A. 17) 

IiflMlmll 

- j::x:  ct*+(x-t0),Hf(x)ii-im{t)i*  dxdt  ia.is) 

IiflMlmll 

- ||  ( t-tc)  • f ||  + lltmll  (A.  19) 

Ilf  II  IlmlT 


The  cross  term  from  A. 17  integrates  to  zero  - see  A. 11. 


A’u  - S.ZS.Z  (a-o,)MF (X+o)  IMH(A)  I*  du  (A. 20) 

OFiMimi 

- j;: j:r  c («-ue)*-f-x*3 -if («)  i*-in (x)  i*  (A.21) 

iFiHinii 

- || (o-Up)F||  + llunil  (A. 22) 

iifii  linn 


Introducing  the  notation 


A’g  - ||(p-/>0)g||  , 

Hall 


we  can  summarize  the  above  results  in  a concise  form. 


A‘t  - A’f 

+ A*m 

(A. 23) 

A*«  - A*F 

+ A*n 

(A. 24) 

The  fact  that 

the  time 

and  frequency  uncertainty 

of  the 

short-time  spectrum 

can  be 

decomposed  into  the  sum 

of  the 

corresponding  uncertainty  for  the  signal  and  the  windou  shows  in  a 
simple  manner  the  effect  of  the  window  on  the  short-time  spectrum. 


72 


The  usual  objective  in  short-time  frequency  analysis  is  not  to 
minimize  the  product  A2<J A*t . but  to  minimize  the  portion  due  to  the 
window.  Equations  A. 23  and  A. 24  suggest  the  use  of  smooth  window 
functions  like  the  gaussian  which  minimize  A2mA2M* 

Inspection  of  A.  18  and  A. 21  show  that  minimizing  the  product 
A2£JA2t  is  equivalent  to  minimizing  the  uncertainty  product  for  the  two 
dimensional,  separable  function  f(x)»m(t)  and  its  Fourier  transform. 

The  result  is  a well  known  property  of  the  two  dimensional  Fourier 
transform,  most  often  encountered  in  quantum  mechanics  C163 . 

( A2f+A2mM  A'F+A'N)  2:  1 (A.  25) 

The  equality  in  A. 25  is  satisfied  when  the  signal  and  the  window  are 
gaussians  with  the  same  variance. 

f(t)  - A«exp  [- ( t-t0)  V2«r*)  *exp(iu0t) 
m«t)  - B-expI-t*/^.^} 


1 


APPENDIX  B 


EXPERIMENTAL  METHODS 
B.l  Computational  Information 

The  experiments  described  in  this  paper  were  performed  using  a 
general  purpose  Digital  Equipment  Corporation  PDP-10  computer,  with 
65,536  36-bit  words  of  memory.  The  main  programs  were  written  in 
FORTRAN,  but  made  use  of  some  assembly  language  (MACRO-10) 
subroutines  to  speed  up  repetitive  operations.  All  calculations 
were  done  with  floating  point  arithmetic.  Magnetic  disk  etorage  was 

used  for  the  short-time  Fourier  transform  and  other  arrays  too  large 
to  fit  in  memory. 

No  explicit  effort  was  made  to  optimize  the  computational 
ef  f eciency  of  the  algorithms  used,  as  the  main  objective  of  this 
research  was  to  demonstrate  the  capabilities  of  time/frequency 
processing.  However,  the  computing  times  required  do  provide  an 

estimate  of  the  computational  effort  inherent  in  this  type  of 
process. 

The  times  for  floating  point  addition  and  multiplication  on  the 
PDP-10  are  5 and  11  microseconds,  respectively.  A quite  efficient 
assembly  language  subroutine  for  the  Fast  Fourier  Transform  was 
available,  which  required  42.Klog,(K)  microseconds  to  compute  the 
discrete  Fourier  transform  of  a < point  complex  sequence  (e.g.,  2048 
point  sequence  requires  ~1  sec).  This  subroutine  was  used  with  a 


74 


FORTRAN  implementation  of  Portnoff's  method  CIS]  for  calculating  the 
discrete  short-time  Fourier  transform  and  its  inverse. 

The  short-time  transform  with  a hanning  window  was  sampled  in 
time  at  a rate  of  8 samples  per  window  length.  This  resulted  in  a 
computing  time  of  about  1 minute  for  the  short-time  tran6formin 
polar  form,  and  inverse  transform  of  10,000  samples  (one  second)  of 
signal.  The  time  varies  slightly  (see  equation  2.14)  dependant  on 
the  number  of  frequency  samples  in  the  short-time  transform.  The 
transform  time  is  increaned  when  the  window  i6  augmented  with  ieros 
to  minimize  aliasing,  as  for  two  dimensional  compression  and 
expansion  in  Chapter  5. 

The  local  Uliener  filtering  and  thresholding  described  in 
Chapter  3 required  30  seconds  to  1 minute  per  second  of  signal,  in 
addition  to  the  transform  time.  The  time  for  two  dimensional 
filtering  varied  widely,  depending  on  the  specific  filter  used.  The 
best  times  occurred  when  the  filter  impulse  response  was  separable 
and  of  realtively  small  extent,  so  that  convolution  could  be 
implemented  directly.  In  these  cases,  filtering  time  was  about  2 
minutes  per  second  of  signal. 

B.2  Recording  and  Playback  of  Signals 

The  speech  used  in  this  research  was  digitized  and  stored  on 
magnetic  disks  in  real  time,  using  a B and  K model  4144  condenser 
microphone,  low  noise  amplifiers,  and  a 15  bit  analog  to  digital 
converter.  The  speech  was  prefiltered  at  4,000  Hz,  and  sampled  at 
10,000  samples  per  second.  The  recordings  were  made  in  a 60und 
isolated  but  acoustically  live  room.  The  overall  signal  to  noise 
ratio  for  the  electronics  at  the  time  of  recording  was  80  dB. 


75 


The  Caruso  recording  used  in  Chapter  3 uas  previously  digitized 
by  T.  G.  Stockham,  Jr. 

Digitized  signals  were  played  back  through  a 1G  bit  digital  to 
analog  converter,  and  filtered  at  4,000  Hz.  Critical  listening  uas 
clone  with  Koes  PR0-4A  headphones.  Signals  could  also  be  reproduced 
uith  Bose  SOI  loudspeakers. 

B.3  Spectrogram  Oisplays 

The  spectrograms  in  this  paper  were  produced  uith  a precision 
cathode  ray  tube  display.  The  output  signal  is  compensated  to 
account  for  the  properties  of  the  cathode  ray  tube  phosphor  and  the 
photographic  film,  so  that  the  intensity  of  the  light  reflected  from 
the  picture  is  proportional  to  the  signal  in  the  computer. 

The  spectrograms  are  256 X 512  point  displays  of  the  magnitude 
of  the  short-time  Fourier  transform.  The  magnitude  is  scaled  to  use 
the  full  dynamic  range  of  the  film;  in  some  cases,  clipping  of  the 
brightest  features  is  allowed  in  order  to  improve  the  visibility  of 
fainter  regions.  The  magnitude  is  normally  scaled  by  G dB/oct  above 
400  Hz  to  remove  the  natural  high  frequency  rolloff  of  speech. 

Uith  the  exception  of  Chapter  4,  all  spectrograms  in  this  paper 
uere  made  by  reanalyzing  the  processed  signal  using  the  same  uindow 
as  the  process.  Those  in  Chanter  4 show  the  modified  short-time 
spectrum  directly,  sinco  a modified  signal  was  not  synthesized. 

The  speech  shown  in  the  spectrograms  was  that  of  a male  with 
relatively  low  pitch.  For  this  reason,  the  spectrograms  sometimes 
show  intermediate  resolution  of  pitch  where  temporal  structure  and 
harmonics  are  simultaneously  visible.  This  is  evident  in  the  uord 


7G 


"man’ 6"  in  Figure  5(b),  for  example.  This  had  no 
on  processing  since  only  lnu  frequency  components 
spectrum  were  modified. 


significant  effect 
of  the  short-time 


REFERENCES 


Cl]  C.  J.  Ueinstein,  "Short-Time  Fourier  Analysis  and  Its 
Inverse,"  M.S.  Thesis,  Elec.  Eng.  Dep. , Mass.  Inst. 

Technol.,  19SS. 

C2]  J.  L.  Flanagan,  Speech  Ana  I us  i s.  Sunthes i s.  and  Percept  i on. 
2nd  ed.,  New  Yorks  Springer-Verlag,  1972. 

[33  A.  V.  Oppenheim,  "Speech  spectrograms  using  the  fa6t  Fourier 
transform,"  !£££,  Spectrum,  vol  7,  pp  57-G2,  August  1970. 

C4]  A.  M.  Liberman,  P.  C.  Oelattre,  and  F.  S.  Cooper,  "The  role  of 
selected  stimuius-variabi es  in  the  perception  of  the  unvoiced 
stop  consonants,  " Amer . J.  Psucho I . . vol  G5,  pp  1-13, 

October  1952. 

C5]  A.  M.  Liberman,  P.  C.  Delattre,  F.  S.  Cooper,  and 

L.  J.  Gerstman,  "The  role  of  consonant-vowel  transitions  in 
the  perception  of  the  stop  and  nasal  consonants,"  Psucho  I . 
Monographs,  vol  GS.  no  379,  1954. 

CGI  J.  L.  Flanagan  and  R.  M.  Golden.  "Phase  vocoder, " Be  I 1 Sust. 
lech-  J. , vol  45,  pp  1493-1509,  November  19GG. 

C7]  M.  R.  Schroeder,  "Vocoders:  analysis  and  synthesis  of  speech, 
a review  of  30  years  of  applied  research,"  Proc.  IEEE,  vol 
E4,  pp  720-734,  May  19GG. 

[83  H.  L.  v.  Helniholz,  Qq  the  Sensat  ions  of  Tone.  Translation  of 
the  fourth  German  edition  of  1877  by  A.  J.  Ellis,  New  York: 
Dover,  1954, 

[93  J.  V.  Tobias,  Foundat i ons  al  Modern  Aud i toru  Theoru.  New  York: 
Academic  Press,  1970. 

[103  M.  R.  Schroeder,  "Models  of  hearing,"  Proc.  IEEE,  vol  G3, 
pp  1932-151,  September  1975. 

[Ill  G.  v.  Bekesy,  Exner  i ments  in  Hear  i no.  New  York:  McGraw-Hill, 
I960. 

[123  N.  Kiang,  Discharge  Patterns  2l  Sinai e Fibres  in  the  Cat*  s 
Audi  toru  Nervt*.  Cambridge,  Mass:  M.I.T.  Press,  19G5. 

*■133  A.  V.  Oppenheim  and  R.  U.  Schafer,  Digital  S i ana  I Process  i no. 
Englewood  Cliffs,  N.J.:  Prentice-Hall , 1975. 


78 


[14]  E.  A.  Guillemin,  Theoru  of  L i near  Phus i ca I Sustems.  New  York! 
Wiley,  19G3. 

[15]  fl.  R.  Portnoff,  " I nplementation  of  the  digital  phase  vocoder 
using  the  fast  Fourier  transform,"  to  be  published  in  I EEE 
Inans.-  Acous. , speech.  Signal  Processing. 

Cl  G]  K.  Gottfried,  Quantum  llechani  cs.  New  Yorks  Benjamin,  19GG: 

vol  1 , p 214. 

[17]  H.  J.  Landau  and  H.  0.  Pollack,  "Prolate  spheroidal  wave 

functions,  Fourier  analysis  and  uncertainty  - II,"  Be  I I Sust. 
Tech.  J. , vol  40,  pp  35-85,  January  1961. 

[18]  R.  B .Blackman  and  J.  U.  Tukey,  The  Measurement  of  Power 
Spectra.  New  Yorks  Dover,  19G8. 

[19]  H.  Y.  Huang,  "A  Collection  of  Digital  Window  Functions,"  M.S. 
Thesis,  Comput.  Sci.  Dep. , Univ.  Utah,  Salt  Lake  City, 
1973. 

[20]  N.  Weiner,  Iba  Egtr.apo.l.ation.  Interpolation,  and  Smoothing  s_f 

S ta  t i onaru  lima  Series  ui  th  Engineering  Add  I i cat  ions.  New 

Yorks  Wiley,  1949. 

[21]  M.  B.  Sachs  and  N.  Kiang,  "Two-tone  inhibition  in  auditory 
nerve  fibers,"  J,  Acoust.  Soc.  Amer. . vol  43,  pp  1120-1128, 
May  1368. 


[22] 

T.  G. 
visual 

Stockham,  Jr. , 
model,"  Proc. 

"Image  processing 
I EEE,  pp  828-842, 

in  the  context 
July  1972. 

of  a 

[23] 

A.  V. 

Oppenheim,  R. 

W.  Schafer,  and 

T.  G.  Stockham, 

Jp  • f 

"Nonlinear  filtering  of  multiplied  and  convolved  signals," 
Proc.  IEEE,  vol  56,  pp  1264-1291,  August  1968. 

[24]  J.  P.  Costas,  "Coding  with  linear  systems,"  Proc  IRE.  . vol  40, 
pp  1101-1103,  September  1952. 

C25J  N.  S.  Jayant  and  L.  R.  Rabiner,  "The  application  of  dither  to 
the  quantization  of  speech  signals,"  Bel  I Syst.  Tech.  J.  , 
vol  51.  pp  1293-lviC4,  July-August  1972. 

[26]  L.  R.  Rabiner  and  J.  A.  Johnson,  "Perceptual  evaluation  of  the 
effects  of  dither  on  low  bit  rate  pern  systems,"  Be  I I Sust . 
T ech.  J. , vol  51,  pp  1487-1494,  Sept  1972. 

[27]  T.  G.  Stockham,  Jr.,  T.  M.  Cannon,  and  R.  B.  Ingebret6en, 
"Blind  oeconvoluticn  through  digital  signal  processing,"  Proc. 
IEEE,  vol  63,  pp678-692,  April  1975. 


79 


[28]  P.  C.  Baudelaire,  "Digital  Picture  Processing  and 
Psychophysics:  A Study  of  Brightness  Perception,"  Ph.D. 

Dissertation,  Coniput.  Sci.  Dep. , Univ.  Utah,  Salt  Lake 
City,  1972. 

[291  H.  Davis  and  S.  R.  Silverman,  Hear i na  and  Deafness.  New  York: 
Holt,  Rinehart  and  Ulinston,  1370. 


80 


ACKNOWLEDGEMENTS 

I wish  to  thank  the  people  who  have  helped  me  in  the  course  of 
the  research  leading  to  this  dissertation.  The  late  Professor 
J.  W.  Keuffel  first  interested  me  in  research  on  speech  and  hearing 
because  of  hie  concern  with  the  problems  of  the  deaf.  His  support 
was  essential  in  allowing  me  to  undertake  this  research.  Professor 
T.  G.  Stockham,  Jr.  supervised  this  dissertation.  Much  of  the  work 
described  here  is  based  cn  his  original  research  in  speech  and  image 
processing.  Many  others  provided  ideas,  encouragement,  and 
technical  support.  Particular  thanks  are  due  to  Dick  Uarnock  and 
Barden  Smith,  who  kept  the  equipment  running,  to  Mike  Milochik,  who 
printed  all  of  my  spectrograms,  and  to  George  Randall  and  Brent 
Baxter,  who  kept  me  honeEt  - most  of  the  time. 


IJNCLASSI  F 1 HD 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  CltTien  D»l»  Enltrmd) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 

1 REPORT  NUMBER 

UTEC-CSc -76-209 

2.  GOVT  ACCESSION  NO. 

3.  RECIPIENT'S  CATALOG  NUMBER 

4.  TITLE  (and  Submit) 

ACOUSTIC  SIGNAL  PROCESSING  BASED  ON  THE 
SHORT-TIME  SPECTRUM 

5.  1 YPE  OF  REPORT  4 PERIOD  COVERED 

Technical  Report 

6.  PERFORMING  ORG.  REPORT  NUMBER 

7 AUTHORfeJ 

B.  CONTRACT  OR  GRANT  NUMBER^; 

Michael  Wayne  Callahan 

D AHC 1 5-73-C-0363 

9 PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Computer  Science  Department 

University  of  Utah 

Salt  Lake  City,  Utah  84112 

IO.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  6 WORK  UNIT  NUMBERS 

ARPA  Order  #2477 

II.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Defense  Advanced  Research  Projects  Agency 

12.  REPORT  DATE 

March  1976 

1400  Wilson  Blvd. 

Arlington.  Virginia 22209 

13.  NUMBER  OF  PAGES 

83 

1 u.  MONITORING  AGENCY  NAME  6 ADDRESS^//  different  from  Controlling  Office) 

IS.  SECURITY  CLASS,  (ol  Ihlt  report; 

UNCLASSIFIED 

IS«.  DECLASSIFICATION^  DOWNGRADING 
SCHEDULE 

16.  DISTRIBUTION  STATEMENT  (ol  thlm  Report; 

This  document  has  been  approved  for  public 
release  and  sale;  its  distribution  is  unlimited 

17.  DISTRIBUTION  STATEMENT  (of  the  ebetrmet  entered  In  Block  20,  If  different  from  Report) 

16.  SUPPLEMENTARY  NOTES 

19.  KEY  WORDS  ( Contlnu e on  reveree  tide  If  neceeemry  end  Identify  by  block  number) 


acoustic  signal  processing,  Fourier  transform,  frequency  domain 
parameters , removal  of  back  ground  noise. 


? 20.  ABSTRACT  (Continue  on  revert.-*  aide  ll  neceeeery  end  Identity  py  block  number)  . . . 

\ The  frequency  domain  representation  of  a time  signal  afforded  by 
! the  Fourier  transform  is  a powerful  tool  in  acoustic  signal  process 
The  usefulness  of  this  representation  is  rooted  in  the  mechanisms  o 
sound  production  and  perception.  Many  sources  of  sound  exhibit 
normal  modes  or  natural  frequencies  of  vibration,  and  can  be  desci  i 
concisely  in  the  frequency  domain.  The  human  auditory  system 
performs  frequency  analysis  early  in  the  hearing  process,  so 

perception  is  often  best  described  by  frequency  domain  parameters. 


i 

f 


DD 


f CRM 


1473 


EDITION  OF  I NOV  #»  IS  OBSOLETE 


* BA  1 1 


SECURITY  CLASSIFICATION  of  THIS  PAOEffnon  P« l«  Enfrtd) . 

This  dissertation  investigates  a new  approach  to  acoustic  sign 
processing  based  on  the  short-time  I'ourier  transform,  c.  two  dimen 
sional  representation  which  shows  the  tieme  and  frequency  structur 
of  sounds.  This  representation  is  appropriate  for  signals  such  as 
speech  and  music,  where  the  natural  frequencies  of  the  source  chan 
and  timing  of  these  changes  is  important  to  perception.  The  princ 
advantage  of  this  approach  is  that  the  signal  processing  domain 
is  similar  to  the  perceptual  domain,  so  that  signal  modifications 
can  be  related  to  perceptual  criteria. 

The  mathematical  basis  for  this  type  of  processing  is  developed, 
and  four  examples  are  described:  removal  oi  broad  band  background 
noise,  isolation  of  perceptually  important  speech  features,  dvnami 
range  compression  and  expansion,  and  removal  of  locally  periodic 
signals. 


